Dimensional Synthetic Eval Set
Generate eval inputs by enumerating tuples over named dimensions (persona × scenario × modality), not by free-form LLM prompting that mode-collapses to a few archetypes.
Intent & Description
🎯 Intent
Make coverage gaps in your eval set visible and auditable, not hidden behind volume.
📋 Context
You asked an LLM to ‘generate 200 eval prompts for this feature’ and got 200 prompts that all look suspiciously similar — covering three archetypes out of 30. Your eval set looks large but covers a sliver of the actual input space.
💡 Solution
Explicitly name the dimensions of your input space: persona (new user / power user / staff), feature variant, scenario (success / failure / ambiguous), modality (text / voice / image). Generate the cross-product of tuples; sample if it’s too large. For each tuple, ask the LLM to generate eval inputs grounded in those specifics. Coverage gaps are now visible — the tuple grid shows which combinations are empty.
Real-world Use Case
- Eval set is being expanded and coverage actually matters.
- Input space has natural dimensions the team can name.
- Mode-collapse in free-form generation has been observed or is suspected.
Source
📌 TL;DR
Don’t ask an LLM to ‘generate 200 evals.’ Name your dimensions, enumerate tuples, seed generation from each. Coverage gaps become visible. Mode-collapse can’t hide.
Advantages
- Coverage is auditable as a tuple grid — no vibe-checking required.
- Mode-collapse can’t hide poor coverage on a named dimension.
- Adding a new dimension is an explicit decision, visible to everyone.
Disadvantages
- Tuple cardinality explodes fast if you name too many dimensions.
- Some tuple combinations are nonsensical and waste generation budget.
- Dimensions must capture meaningful variance — arbitrary axes produce meaningless coverage.