Skip to content

Comments

Fix benchmark exploit via object-identity caching#102

Open
msaroufim wants to merge 2 commits intomainfrom
fix-benchmark-object-identity-exploit
Open

Fix benchmark exploit via object-identity caching#102
msaroufim wants to merge 2 commits intomainfrom
fix-benchmark-object-identity-exploit

Conversation

@msaroufim
Copy link
Member

@msaroufim msaroufim commented Feb 7, 2026

Summary

  • Fixes vulnerability where submissions could cache results based on Python object identity (id(tensor))

Changes

  1. Clone data before each timing iteration (outside the timed region) - gives fresh object identities while not affecting measured kernel time
  2. Use local seed variable instead of mutating test.args["seed"] - avoids shared mutable state

The benchmark harness was vulnerable to submissions that cache results
based on Python object identity (e.g., id(tensor)). Since the same
data objects were reused across all timing iterations, a submission
could cache on first call and return cached results on subsequent
calls, showing artificial speedups of 12-36%.

Changes:
- Clone data before each timing iteration (outside the timed region)
  to give each iteration fresh object identities while not affecting
  measured kernel time
- Use local seed variable instead of mutating test.args["seed"] to
  avoid shared mutable state between benchmark runs
Additional hardening on top of the object-identity caching fix:

- Shuffle data order each timing iteration to prevent call-count
  caching (a submission could track invocation count and predict
  which data item appears at each position)
- Move clone before torch.cuda.synchronize() so clone GPU copies
  can overlap with previous iteration's tail work
- Fix pre-existing recheck bug where only the last item's
  correctness was checked (if not good was outside the for loop)
- Use shuffle_order indices to correctly pair shuffled outputs
  with their reference data during recheck
nataliakokoromyti added a commit to nataliakokoromyti/reference-kernels that referenced this pull request Feb 22, 2026
…ess checks

The current eval times all 15 custom_kernel() calls as a single batch and
divides by 15. A malicious submission can exploit this by deferring all work
to one call (batching 15 problems into a single kernel launch) and making the
other 14 calls no-ops, reporting ~1/15th of the real per-call cost.

Cloning data alone (as proposed in gpu-mode#102) does not fully prevent this -- a
shape-matching fallback path can still collect new data objects and batch them.

This fix:
- Clones data each timing iteration (prevents object-identity caching)
- Times each call individually with its own CUDA events and GPU sync
  (prevents amortization across calls)
- Checks correctness after each individual call in recheck/leaderboard mode
  (catches deferred-computation exploits that return uncomputed tensors)
- Uses a local seed variable instead of mutating test.args
- Fixes the recheck indentation bug where only the last call was checked
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant