Blind Grader with Isolated Context
Run the evaluator in a fresh context window with only the artifact and the rubric — never the producer's reasoning chain — so the grader can't inherit the same blind spots.
Intent & Description
🎯 Intent
Catch failures that same-context critique systematically misses by grading blind.
📋 Context
Your producer agent runs a long reasoning chain and builds an artifact. The downstream evaluator gets handed the producer’s full trace alongside the artifact — and predictably agrees with it, inheriting the same assumptions and missing the same errors.
💡 Solution
When the producer finishes, allocate a fresh context window. Construct a grader call containing only the artifact and the rubric. Deliberately exclude the producer’s reasoning chain, scratchpad, and prior turns. The grader judges on its own terms. Log the verdict against both the artifact and the producer’s trace for audit — but the grader was blind at decision time. Same model works fine; context isolation is the load-bearing element.
Real-world Use Case
- Producer self-critique has a known echo-chamber failure mode on this task.
- A rubric can be written that doesn’t require the producer’s reasoning to apply.
- The artifact is self-contained enough to grade on its own.
Source
📌 TL;DR
Fresh context. Artifact + rubric only. No producer trace. The grader can’t inherit blind spots it never saw. Context isolation does the work, not a different model.
Advantages
- Catches a class of failures that same-context critique systematically misses.
- Works with the same model — no second-vendor cost or routing complexity.
- Rubric becomes a first-class artifact since the grader has nothing else to lean on.
- Clean audit story: producer trace and grader verdict are independently attributable.
Disadvantages
- Grader can’t use legitimate context from the producer’s reasoning — rubric must carry it explicitly.
- Rubric authoring becomes the bottleneck; a vague rubric in isolation is worse than a tight rubric with trace.
- Extra context allocation costs tokens and latency per check.
- Discipline required: even a summary of the producer’s trace in the grader’s context defeats the pattern.