Benchmark Contamination Detection
Verify that benchmark test sets don't appear in training data — inflated scores from contamination measure memorization, not capability.
Intent & Description
🎯 Intent
Catch training data contamination that would invalidate benchmark scores — high scores should reflect genuine generalization, not memorized test answers seen during pre-training.
📋 Context
Web-crawled pre-training corpora inevitably contain benchmark data. A model that saw MMLU questions during pre-training scores higher on MMLU because it memorized answers — not because it’s smarter. Published benchmark results without contamination analysis are untrustworthy as capability measurements.
💡 Solution
For each benchmark, compute n-gram overlap between test set strings and the training corpus. Flag examples with above-threshold token overlap as contaminated. Report scores separately for clean and contaminated subsets — or exclude contaminated examples entirely (contamination-filtered benchmark). For models without training corpus access, use min-k% probability probing: measure the model’s output probability on test answer strings as a memorization signal without corpus inspection.
Real-world Use Case
📌 TL;DR
High scores on contaminated test sets measure memorization, not capability. Check n-gram overlap before trusting or publishing benchmark numbers.
Advantages
- Makes benchmark scores meaningful rather than inflated by memorization
- Contamination-filtered results are comparable across models regardless of corpus composition
- Identifies which benchmark results reflect genuine generalization vs. training data overlap
Disadvantages
- Requires training corpus access for n-gram overlap analysis — not available for third-party models
- N-gram overlap misses paraphrased or lightly edited benchmark content
- No universally accepted contamination threshold — field practices vary widely