Risk-Averse Reward Proxy
When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across ...
Intent & Description
🎯 Intent
When operating outside the distribution the reward was designed for, treat the specified objective as a noisy proxy and plan conservatively across plausible true objectives.
📋 Context
An agent’s reward (prompt, scoring function, fine-tune signal) was designed against a specific training or testing distribution. The agent now operates in a novel situation: a new domain, new user type, new task shape. The reward continues to score outputs, but its mapping to what the designer would have wanted in this novel context is no longer reliable.
💡 Solution
Following Inverse Reward Design: treat the designed reward as an observation about the true reward under the design distribution. In a novel context, maintain a set (or posterior) of true rewards consistent with that observation. Plan risk-averse over the set — prefer actions whose worst-case (or low-quantile) value across plausible true rewards is acceptable, rather than actions that maximise expected value under the literal proxy. Direct mitigation against specification gaming in deployment shift.
Real-world Use Case
- The agent regularly encounters contexts outside the reward’s design distribution.
- Specification gaming or reward hacking in novel contexts is a real risk.
- Engineering capacity exists to construct a plausible-reward set or posterior.
Source
Advantages
- Directly limits reward-hacking exposure in novel contexts.
- Composes with preference-uncertain agents naturally.
- Makes ‘distribution shift’ a planning-time consideration, not just a monitoring one.
Disadvantages
- Conservatism loses literal-proxy performance even when not needed.
- Set/posterior over true rewards is hard to construct honestly.
- Out-of-distribution detection is itself unreliable — the pattern may activate too rarely or too often.