Fix batch-and-skip benchmark exploit via per-call timing#104
Fix batch-and-skip benchmark exploit via per-call timing#104nataliakokoromyti wants to merge 3 commits intogpu-mode:mainfrom
Conversation
…ess checks The current eval times all 15 custom_kernel() calls as a single batch and divides by 15. A malicious submission can exploit this by deferring all work to one call (batching 15 problems into a single kernel launch) and making the other 14 calls no-ops, reporting ~1/15th of the real per-call cost. Cloning data alone (as proposed in gpu-mode#102) does not fully prevent this -- a shape-matching fallback path can still collect new data objects and batch them. This fix: - Clones data each timing iteration (prevents object-identity caching) - Times each call individually with its own CUDA events and GPU sync (prevents amortization across calls) - Checks correctness after each individual call in recheck/leaderboard mode (catches deferred-computation exploits that return uncomputed tensors) - Uses a local seed variable instead of mutating test.args - Fixes the recheck indentation bug where only the last call was checked
|
Hey @nataliakokoromyti — this is awesome, thanks for writing it up so clearly. The explanation of why #102’s clone+shuffle isn’t enough (shape-match + pointer-update path) is exactly right. One thing I noticed when I ran One nuance on semantics: the per-call correctness checks guarantee “correct when checked,” but they don’t fully enforce call independence as a contract. A clever submission can still coordinate across calls (batching/deferral) as long as it lands the writes before the check. So it’s a strong hammer, but it’s costly and still not quite the clean independence guarantee we want. A direction that seems both cheaper and more targeted is output fingerprint auditing (we’ve been experimenting with this and it’s been working well):
This hits the exploit mechanism directly (temporal integrity) instead of inferring cheating from timing skew. In our tests a fingerprint-based audit ( A couple notes / caveats on fingerprinting (worth us digging into together):
I also tried a couple follow-ups building on your approach:
Diff is here (easy to cherry-pick pieces): https://gist.github.com/G-structure/f9de3df9b051f43c06422ffd7a21a8dd Down to pair on integrating this in a way that keeps leaderboard scoring “real,” with the stronger checks happening only on integrity repeats. |
|
Hi, I think it does avoid most of the problems mentioned above, and it tries to minimize the overhead of the benchmarking framework by implementing the main loop in C++, calling the user function through nanobind. Note that the checking kernel is started using PDL to minimize the attack window, and checks the entries in the result in randomized order. |
208fd03 to
11fe446
Compare
11fe446 to
340e48e
Compare
|
thanks @G-structure and @ngc92 for your help and thoughtful responses. idk what the timeline for migrating to cpp is (great idea) but till then sth like this pr ^ could be beneficial. |
Summary
Fixes a benchmark exploit in
eval_better_bench_grouped_gemm.pywhere a submission can batch all 15custom_kernel()calls into a single GPU kernel launch and make 14/15 timed calls into no-ops (pure dict lookups returning cached results). This reports ~1/15th of the real per-call cost.Why #102's fix is insufficient: The clone+shuffle approach in #102 breaks trivial
id()-based caching, but a more sophisticated exploit uses a shape-matching fallback path that collects cloned data objects by problem shape and still batches them — the pointer-update path doesn't depend on stableid()values at all.Changes
custom_kernel()call is individually timed withtorch.cuda.synchronize()between calls, preventing work deferral across callstest.args["seed"]across iterationsHow the exploit works
The exploit:
id(), tensors, and results_build_superbatch(): Merges all 15 × 8 groups = 120 groups into a single kernel launchid()triggers the batched kernel; the other 14 return pre-cached results (zero GPU work)id()values change (e.g., after cloning), collects all 15 new objects by shape match, updates pointer tables, and still launches only once — defeating clone-based mitigationsWhy this fix works
Test plan