Mobile UI Agent
Drive a smartphone end-to-end through a touch-native action vocabulary (tap, swipe, type, back, home) — purpose-built for mobile, not a desktop agent bolt-on.
Intent & Description
🎯 Intent
Operate mobile apps on real or emulated phones using the same touch interface a human uses.
📋 Context
You need an agent to operate a ride-hailing app, food delivery app, banking app, or super-app on a phone. No public API, no clean web frontend. The only surface is the touch UI itself.
💡 Solution
Define a touch-native action vocabulary: tap(x,y), long_press(x,y), swipe(dir), type(text), back, home. The agent receives a screenshot (optionally with extracted UI element annotations), reasons in text about which element to act on, emits an action call, and observes the next screenshot. Specialize the vocabulary per platform (Android vs iOS) but keep the agent loop platform-agnostic.
Real-world Use Case
- The target environment is a smartphone where touch is the only useful input surface.
- Desktop Computer Use or Browser Agent action sets are the wrong shape for the task.
- A small touch-native vocabulary (tap, swipe, type, back, home) covers the workflow.
Source
📌 TL;DR
Touch-native vocab (tap, swipe, type) + screenshots = mobile agent. Works on any app you can see. Watch out for coordinate brittleness and accidental payments.
Advantages
- Works against any app whose UI is visible — including third-party apps with no APIs.
- Single agent loop generalizes across apps once the vocabulary is fixed.
- Vision + small action set is a tractable model footprint.
Disadvantages
- Coordinate-based taps are brittle to screen size, theme, or locale changes.
- Pure-vision grounding mistakes are common; element-annotation pipelines add complexity.
- Sensitive actions (payments, deletions) are easy to mis-fire.