Soft-Optimization Cap
Cap how strongly the agent optimises its inferred objective β sample from the top quantile of acceptable actions rather than the argmax, or stop im...
Intent & Description
π― Intent
Cap how strongly the agent optimises its inferred objective β sample from the top quantile of acceptable actions rather than the argmax, or stop improving once the objective is good enough.
π Context
An agent’s planner can produce a range of actions scored by the objective. The naΓ―ve choice is argmax β pick the highest-scoring action. Russell-aligned reading: argmax exhausts whatever specification gap exists between the inferred objective and the true preference, and leaves no headroom for human correction.
π‘ Solution
Following Taylor’s quantilizers: define a base distribution over actions (the agent’s prior over reasonable moves). To pick an action, sample from the top q-quantile of that distribution ranked by the inferred objective. The classic bound: a q-quantilizer’s expected cost under any bounded utility is at most 1/q times the cost of the base distribution. In practice for LLM agents: take top-k sampling on the planner, or set a satisficing threshold and accept the first action that clears it. Cap is a tuned parameter, not optimisation.
Real-world Use Case
- The agent’s inferred objective is plausibly mis-specified at the tail.
- A reasonable base distribution of human-endorsed actions exists.
- Some loss of expected score is acceptable in exchange for tail safety.
Source
Advantages
- Bounded cost under specification gaming with a tunable knob.
- Composes with preference-uncertain and risk-averse patterns.
- Operationally simple: a top-k sampler or a satisficing threshold is implementable.
Disadvantages
- Caps lose some expected score on aligned objectives.
- The base distribution itself must be reasonable β quantilizing over a bad base does not help.
- Tuning q is a judgment call without a clear principled answer.