LLM-as-Judge
Score open-ended agent outputs against a written rubric using an LLM judge — automate quality evaluation where no exact-match metric applies.
Intent & Description
🎯 Intent
Use an LLM to score open-ended outputs against rubric criteria when no exact-match metric applies.
📋 Context
Your agent emits free-form text — summaries, generated code, long-form prose, support replies — where no single reference answer is uniquely correct. You want automated regression detection on every release or pull request, not paced by how many outputs a human can grade in a week.
💡 Solution
Define a rubric. Prompt a judge model with the input, candidate output, and rubric. Receive a structured score plus rationale. Calibrate periodically against human-graded samples. Use a different model family for judge vs candidate where possible to reduce self-serving bias.
Real-world Use Case
- Open-ended outputs need automated regression detection without a reference answer.
- A rubric can be written that covers the qualities you actually care about.
- Calibration against human-graded samples is feasible periodically.
Source
📌 TL;DR
Use a rubric-prompted judge model to score free-form outputs automatically — get regression detection on every commit without waiting for a human review queue.
Advantages
- Scales free-form evaluation to every PR and release without a human review queue.
- Judge rationales are debugging breadcrumbs — not just a score, but a reason.
Disadvantages
- Judge biases skew scores in subtle, hard-to-detect ways.
- Cost: every eval run is now N × judge model calls.