Policy-Localizer-Validator
Split a GUI agent into three specialist models — Policy (plans), Localizer (grounds pixels), Validator (checks completion) — each sized to its job.
Intent & Description
🎯 Intent
Attribute failures cleanly and minimize cost by routing each subproblem to the smallest sufficient model.
📋 Context
You’re running a browser or desktop agent through long trajectories. Per-step cost and latency matter. Failures are hard to attribute: is it a bad plan, a bad click, or a wrong ‘done’ signal? You want clean attribution and independently tunable components.
💡 Solution
Three-model pipeline per step. Policy LLM reads current screenshot + task state, emits a textual action (“click the Sign In button in the top-right”). Localizer VLM takes that description + screenshot, returns pixel coordinates. Action executes. Validator VLM inspects the resulting screenshot: task complete? If uncertain → continue; if confident-complete → halt; if confident-failed → retry or escalate. Each model is independently sized — Policy is largest, Localizer is a small specialist, Validator is mid-size.
Real-world Use Case
- Agent drives a GUI or browser via screenshots and actions across long trajectories.
- Per-step cost matters enough to justify specialized models.
- Failure-mode attribution is needed for debugging or audit.
- Open-weights specialist VLMs are available or trainable for the target domain.
Source
📌 TL;DR
Policy plans, Localizer clicks, Validator says done. Three specialist models, each sized right. Clean failure attribution, lower total cost than one giant model.
Advantages
- Each role uses the smallest sufficient model — total cost lower than a monolithic approach.
- Failures attribute cleanly: bad plan, bad grounding, or bad commit decision.
- Validator gives a real stop signal uncorrelated with the planner’s optimism.
- Specialist VLMs can be trained on open weights without retraining the planner.
Disadvantages
- Three models = three deployment targets, three training pipelines, three versioning surfaces.
- The inter-model interface (textual action description) becomes a contract that must stay stable.
- Validator must be calibrated or it stops too early or too late.
- Until the Validator is trained on the target domain, completion judgments are weak.