Preference-Uncertain Agent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
Intent & Description
🎯 Intent
Agent treats its own reward/objective as a hidden variable to be inferred from human behaviour, not a fixed target.
📋 Context
An LLM agent is given an objective by prompt or by fine-tuning. Russell’s framing: the prompt is at best an observation about what the designer wants, not the underlying preference. Treating the prompt as the ground-truth reward is a category error that compounds over long-horizon deployments.
💡 Solution
Pose the agent’s planning problem as expected-utility maximisation under a reward posterior, not a known reward. Update the posterior from corrections, demonstrations, and explicit feedback. Expose the posterior summary in traces. Build downstream patterns (off-switch incentive, soft-optimization cap, cooperative preference inference) on top of it. Distinct from confidence-calibration on outputs: this is calibration on the objective itself.
Real-world Use Case
- Long-horizon deployments where the objective is unlikely to be fully specifiable up front.
- Stakes high enough that quietly mis-optimising a proxy is catastrophic.
- Engineering capacity to maintain and update a reward posterior exists.
Source
Advantages
- Deference, asking, and pausing become principled moves.
- Composes with off-switch incentive and soft-optimization cap.
- Surfaces alignment as ongoing inference, not a one-shot fine-tune.
Disadvantages
- Maintaining a reward posterior for LLM agents is research-grade engineering.
- Over-uncertain agents are paralysed; under-uncertain agents revert to the failure modes.
- Posterior summarisation in traces is itself non-trivial; principals may not interpret it correctly.