Fix benchmark exploit via object-identity caching by msaroufim · Pull Request #102 · gpu-mode/reference-kernels

msaroufim · 2026-02-07T00:01:44Z

Summary

Fixes vulnerability where submissions could cache results based on Python object identity (id(tensor))

Changes

Clone data before each timing iteration (outside the timed region) - gives fresh object identities while not affecting measured kernel time
Use local seed variable instead of mutating test.args["seed"] - avoids shared mutable state

The benchmark harness was vulnerable to submissions that cache results based on Python object identity (e.g., id(tensor)). Since the same data objects were reused across all timing iterations, a submission could cache on first call and return cached results on subsequent calls, showing artificial speedups of 12-36%. Changes: - Clone data before each timing iteration (outside the timed region) to give each iteration fresh object identities while not affecting measured kernel time - Use local seed variable instead of mutating test.args["seed"] to avoid shared mutable state between benchmark runs

Additional hardening on top of the object-identity caching fix: - Shuffle data order each timing iteration to prevent call-count caching (a submission could track invocation count and predict which data item appears at each position) - Move clone before torch.cuda.synchronize() so clone GPU copies can overlap with previous iteration's tail work - Fix pre-existing recheck bug where only the last item's correctness was checked (if not good was outside the for loop) - Use shuffle_order indices to correctly pair shuffled outputs with their reference data during recheck

…ess checks The current eval times all 15 custom_kernel() calls as a single batch and divides by 15. A malicious submission can exploit this by deferring all work to one call (batching 15 problems into a single kernel launch) and making the other 14 calls no-ops, reporting ~1/15th of the real per-call cost. Cloning data alone (as proposed in gpu-mode#102) does not fully prevent this -- a shape-matching fallback path can still collect new data objects and batch them. This fix: - Clones data each timing iteration (prevents object-identity caching) - Times each call individually with its own CUDA events and GPU sync (prevents amortization across calls) - Checks correctness after each individual call in recheck/leaderboard mode (catches deferred-computation exploits that return uncomputed tensors) - Uses a local seed variable instead of mutating test.args - Fixes the recheck indentation bug where only the last call was checked

msaroufim added 2 commits February 6, 2026 16:01

msaroufim requested review from S1ro1, alexzhang13 and ngc92 February 7, 2026 03:53

nataliakokoromyti mentioned this pull request Feb 22, 2026

Fix batch-and-skip benchmark exploit via per-call timing #104

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix benchmark exploit via object-identity caching#102

Fix benchmark exploit via object-identity caching#102
msaroufim wants to merge 2 commits intomainfrom
fix-benchmark-object-identity-exploit

msaroufim commented Feb 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

msaroufim commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msaroufim commented Feb 7, 2026 •

edited

Loading