Computer Use
Let the model drive a desktop end-to-end via screenshots and virtual mouse/keyboard — no bespoke per-app APIs needed.
Intent & Description
🎯 Intent
Control any GUI app the same way a human would — through the screen.
📋 Context
You need an agent to operate a legacy accounting suite, internal CRM, or custom Windows utility that has no public API and no plugin hooks. The agent has to work the same screen, mouse, and keyboard a human would.
💡 Solution
The model receives screenshots (optionally with accessibility-tree or set-of-mark annotations) and emits typed tool calls (move mouse, click, type, scroll, screenshot). A controller executes them against a real or virtual desktop. The loop is ReAct-shaped: screenshot → think → act → screenshot.
Real-world Use Case
- The target software has no clean API and the agent must drive it visually.
- Screenshots plus virtual mouse/keyboard tool calls fit the environment.
- The model vendor exposes sufficient screen-grounding capability.
Source
📌 TL;DR
Computer Use = screenshot → think → click. Works on any GUI with no API. Slow and injection-prone, but the only option when nothing else exists.
Advantages
- Universal coverage — if a human can use it, the agent can use it.
- Zero per-app integration work; no API contracts to maintain.
Disadvantages
- Slow and brittle on dynamic UIs where layout shifts between actions.
- Screen content is now part of the prompt — prompt injection via on-screen text becomes a real attack surface.