Scorer Live Monitoring
Score agent outputs asynchronously after they reach the user — multiple scorer types running in parallel, zero latency impact, low-score events routed to a review queue.
Intent & Description
🎯 Intent
Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log — but do not regenerate the output.
📋 Context
You’re running an agent on real user traffic with a tight latency budget, but you need a continuous quality signal — not just a snapshot at release time. Multiple quality dimensions matter simultaneously: helpfulness (LLM judge), forbidden phrases (programmatic), reference similarity (embedding), rubric compliance. You can’t block the user path for all of these.
💡 Solution
After the agent returns to the user, publish {request_id, input, output, context} to a scoring stream. Independent scorer workers consume the stream and emit {request_id, scorer, score, evidence} records. Aggregate into dashboards and alert rules; route low scores into a re-evaluation queue rather than triggering re-generation in the user’s request path.
Real-world Use Case
- Production quality must be observed continuously, not just measured at release.
- Latency budget on the user path doesn’t allow a blocking judge call.
- Multiple scorer types (LLM judge, programmatic check, embedding similarity) should run side by side.
Source
📌 TL;DR
Score outputs async after they reach users — multiple scorer types in parallel, zero request latency, bad outputs routed to a review queue. Observation, not prevention.
Advantages
- Continuous live-traffic quality signal with zero latency cost in the user path.
- Many scorer types run side-by-side without contention.
- Low-score events accumulate into a review queue rather than firing in the moment.
- Cost is bounded by sampling rates per scorer.
Disadvantages
- Open-loop — the bad output already reached the user; this pattern observes rather than corrects.
- Async scorers under traffic spikes can lag the signal by minutes.
- Judge-model scorers drift across model versions — rubric versioning matters.
- Scorer costs can creep without governance on sampling rates.