Cooperative Preference Inference
Treat alignment as an ongoing two-player game — the agent maintains a reward posterior and updates it continuously from human demonstrations, corrections, and questions rather than relying on a fixed objective.
Intent & Description
🎯 Intent
Human preferences shift, are partially observable, and were never fully written down. A static objective drifts out of alignment silently — this makes alignment an ongoing inference problem instead of a one-shot setup.
📋 Context
A long-running personal or organizational agent serves a human whose true preferences shift over time and were never specified completely. The agent observes demonstrations, corrections, partial instructions, and explicit questions — but has no closed-form objective function to optimize.
💡 Solution
Model the interaction as Cooperative Inverse Reinforcement Learning (CIRL). Both human and agent share a reward function known only to the human. The agent observes human actions, demonstrations, and corrections as evidence about R, maintains a posterior over R, and acts to maximize expected R under that posterior. Optimal play drives active teaching (the human shows informative examples) and active learning (the agent asks targeted questions). Distinct from RLHF: CIRL is continuous and online, not one-shot offline.
Real-world Use Case
- Long-running deployment where preferences shift and were never fully specified upfront.
- The agent has access to ongoing corrections, demonstrations, and questions as live signal.
- Building principled uncertainty into the agent’s objective is worth the engineering cost.
Source
Advantages
- Alignment is treated as ongoing inference rather than a one-shot fine-tune
- Demonstrations, corrections, and questions all become equally valid signal sources
- Models a principled trade-off between asking and acting under uncertainty
Disadvantages
- Closed-form CIRL solutions don’t scale to LLM-sized hypothesis spaces — LLM versions are approximations
- Requires the agent to maintain and update a reward posterior — heavy machinery for many products
- Misinterpreted human actions can push the posterior in damaging directions