Dual-System GUI Agent
Split a GUI agent into a decision model that plans and a grounding model that clicks — each optimized for its own job.
Intent & Description
🎯 Intent
Route planning and pixel-grounding to separate models that each handle their subproblem well.
📋 Context
You’re running a long multi-step GUI workflow — filling a multi-page form, booking a ride, confirming payment. You need both flexible high-level replanning (what to do when the form looks different than expected) and pixel-accurate click grounding. One model doing both underperforms on at least one.
💡 Solution
Define a clean intermediate vocabulary: the decision model emits high-level intents (“open the cart”, “swipe left to next item”) in a small typed vocabulary. The grounding model receives that intent plus the current screenshot and emits the concrete action (tap coordinates, key press). Decision model holds the plan and replans on failure; grounding model is stateless per action but specialized on screen interpretation.
Real-world Use Case
- A single GUI model is dominated by either planning or grounding and underperforms on the other.
- A clean intermediate vocabulary can express decisions for grounding.
- Two specialized models are available and routing between them is feasible.
Source
📌 TL;DR
Decision model plans. Grounding model clicks. Two specialized models working in sequence beat one generalist doing both.
Advantages
- Each model is sized to its skill — total parameters smaller than a unified model.
- Failure attribution is clean: planning problem vs. grounding problem.
- Decision-model planning generalizes across desktop, web, and mobile; grounding model is per-surface.
Disadvantages
- Two model calls per turn — latency and cost double.
- The intermediate intent vocabulary is a real design problem; bad vocabulary = broken hand-off.
- Hand-off mistakes (decision says X, grounding hears Y) are hard to debug.