Agent-as-a-Judge
Use a second agent to evaluate the full execution trajectory — every step, tool call, and intermediate state — not just the final answer.
Intent & Description
🎯 Intent
Evaluate an agent’s full trajectory — steps, tool calls, intermediate states — rather than scoring only the final output.
📋 Context
For multi-step tasks (fixing a real bug, chaining tool calls to answer a question), the final answer alone is a poor quality signal. An agent can arrive at a right answer through a terrible, inefficient, or unsafe path. You need trajectory-level evals.
💡 Solution
A judge agent receives the candidate agent’s full trajectory: thoughts, tool calls, observations, intermediate state, and final answer. It evaluates against a rubric covering correctness, efficiency, and process quality, then outputs a structured verdict with rationale. Use a different model family for judge vs candidate to reduce self-serving bias.
Real-world Use Case
- Agent tasks can succeed or fail along the trajectory in ways the final answer cannot reveal.
- You have access to the full trajectory (thoughts, tool calls, observations) of the candidate agent.
- Process-quality signals — efficiency, redundant steps, unsafe actions — matter for the verdict, not just correctness.
Source
📌 TL;DR
Agent-as-a-Judge grades the journey, not just the destination — essential when a wrong path can cause real damage even if the agent lands on the right answer.
Advantages
- Catches process-level failures hiding behind correct answers.
- Produces inspectable judge rationales — not just a score, but a why.
Disadvantages
- Expensive — trajectory evaluation means a full judge model call per run.
- Calibrating the judge on trajectory rubrics requires its own labeled dataset effort.