Benchmark Contamination Detection | designpattern.fyi

designpattern.fyi

Software Design Catalog

Language Models

Back to Catalog

Language Models Benchmarking

Benchmark Contamination Detection

Verify that benchmark test sets don't appear in training data — inflated scores from contamination measure memorization, not capability.

Intent & Description

🎯 Intent

Catch training data contamination that would invalidate benchmark scores — high scores should reflect genuine generalization, not memorized test answers seen during pre-training.

📋 Context

Web-crawled pre-training corpora inevitably contain benchmark data. A model that saw MMLU questions during pre-training scores higher on MMLU because it memorized answers — not because it’s smarter. Published benchmark results without contamination analysis are untrustworthy as capability measurements.

💡 Solution

For each benchmark, compute n-gram overlap between test set strings and the training corpus. Flag examples with above-threshold token overlap as contaminated. Report scores separately for clean and contaminated subsets — or exclude contaminated examples entirely (contamination-filtered benchmark). For models without training corpus access, use min-k% probability probing: measure the model’s output probability on test answer strings as a memorization signal without corpus inspection.

Real-world Use Case

Validating benchmark scores before publishing model capability claims. Evaluating models trained on web-scraped corpora against popular public benchmarks. Compliance with responsible reporting standards in model papers and release docs.

📌 TL;DR

High scores on contaminated test sets measure memorization, not capability. Check n-gram overlap before trusting or publishing benchmark numbers.

Advantages

Makes benchmark scores meaningful rather than inflated by memorization
Contamination-filtered results are comparable across models regardless of corpus composition
Identifies which benchmark results reflect genuine generalization vs. training data overlap

Disadvantages

Requires training corpus access for n-gram overlap analysis — not available for third-party models
N-gram overlap misses paraphrased or lightly edited benchmark content
No universally accepted contamination threshold — field practices vary widely

46 of 58

© 2026 designpattern.fyi. Crafted with ❤️ for modern software engineers by OpenAGI