World Model as Tool
Let the planning agent call a generative world model (video diffusion, physics sim) as a tool to preview action consequences before committing — lookahead without acting first.
Intent & Description
🎯 Intent
Ground planning decisions in simulated rollouts for actions whose physical consequences are hard to reason about in text.
📋 Context
Your planning agent operates in an environment with physics, geometry, or rich perceptual dynamics — a household robot, game agent, or control system. Some actions are irreversible or expensive. A capable generative world model (video diffusion, learned dynamics, external simulator) exists and can produce plausible rollouts.
💡 Solution
Register the generative world model behind a tool interface: input is a structured current state + candidate action sequence; output is a generated rollout (video frames, simulated trajectory, predicted observations) plus optional uncertainty. The agent calls this tool before committing to any irreversible or expensive action, compares predicted rollouts across candidates, and uses simulator agreement as a gate. Treat the world model as fallible — its output is evidence, not truth.
Real-world Use Case
- Actions have physical or perceptual consequences the agent can’t reliably reason about in text.
- A capable generative world model is available as an external service or local model.
- Some actions are irreversible enough that even a noisy lookahead pays for itself.
Source
📌 TL;DR
World model as a tool = simulate before you act. Call it for irreversible actions, compare rollouts, treat output as evidence not truth. Slow and expensive but worth it when stakes are high.
Advantages
- Foresight grounded in a real generative simulator, not just text reasoning.
- Decouples the agent from any one world model — swap the tool when a better one ships.
- Rollouts are inspectable artifacts (video, trajectory) — useful for debugging and post-hoc review.
Disadvantages
- Generative world models are slow and expensive to call per planning step.
- Rollouts hallucinate — treating them as ground truth introduces a new failure mode.
- Encoding state and action well enough for the world model is non-trivial.
- Aggregating noisy rollouts with text reasoning is an open design problem.