RL-Trained Conductor Orchestrator
A small RL-trained conductor sits in front of a pool of frontier LLM workers — learning which worker to call for which subtask from task-outcome rewards rather than hand-written routing rules.
Intent & Description
🎯 Intent
Train a small meta-model with reinforcement learning to dynamically dispatch sub-tasks across a pool of frontier LLM workers, learning the communication topology end-to-end rather than hard-coding routing.
📋 Context
A production multi-agent stack dispatches subtasks across a heterogeneous pool of frontier LLMs from different vendors — one strong at long-context summarization, one at code synthesis, one at image understanding. The routing logic is hand-written if-then rules that can’t keep up as the vendor pool changes and tasks span many domains.
💡 Solution
A small conductor model sits in front of a pool of worker LLMs and tools. On each step the conductor emits a natural-language subtask instruction and a worker selection; the worker runs, its output is returned, and the conductor decides the next move. The conductor is trained with RL against final task rewards — it learns which workers handle which subtask shapes, how to phrase the handoff, when to stop, and when to recursively dispatch a subtask back to itself. Workers remain frozen frontier models; only the conductor is trained.
Real-world Use Case
- A heterogeneous frontier-model worker pool is in production and routing quality materially affects outcomes.
- Task-outcome rewards are observable at scale.
- An RL training pipeline (or a partner who provides one) is available.
Source
📌 TL;DR
A small RL-trained conductor learns which frontier LLM worker to call for each subtask shape — routing from experience beats hand-written if-then rules on a dynamic worker pool.
Advantages
- Routing improves from experience instead of hand-editing rules on each model release.
- Cheap meta-model on the hot path — frontier models are only called as workers when selected.
- Recursive self-dispatch handles decomposable subtasks without a separate planner agent.
- Worker pool churn is absorbed by retraining the conductor, not rewriting routing logic.
Disadvantages
- Requires a reward signal and an RL training pipeline — most teams don’t have this in-house.
- Conductor policy can be opaque; a learned routing tree is harder to audit than a written one.
- Recursive self-dispatch needs strict depth and budget caps or it can fan out aggressively.
- Worker drift (vendor updates a model) silently changes the policy’s effective action semantics.