Self-RAG
Fine-tune the model to emit reflection tokens that decide when to retrieve, evaluate retrieved relevance, and assess generated support.
Intent & Description
🎯 Intent
Fine-tune the model to emit reflection tokens that decide when to retrieve, evaluate retrieved relevance, and assess generated support.
📋 Context
A team is building a retrieval-augmented system where retrieval is not always the right thing to do. Some queries are easy and can be answered from the model’s parametric knowledge; others genuinely require fresh evidence from the corpus. Even when retrieval happens, the chunks returned may not be relevant, and even when they are relevant, the final generation may not actually be supported by them. The team needs the model itself to reason about each of these decisions per request, instead of forcing every query through the same fixed pipeline.
💡 Solution
A critic model is first trained to label data with reflection tokens. The generator is then fine-tuned on the labeled data to emit four reflection tokens inline at inference: [Retrieve], [IsRel] (is retrieved evidence relevant?), [IsSup] (is generation supported?), [IsUse] (is generation useful?). The host enforces the reflection grammar and uses tokens to control flow.
Real-world Use Case
- Retrieval-augmented generation needs to decide when to retrieve and whether evidence is relevant.
- Static retrieve-then-generate wastes calls or admits hallucination.
- Fine-tuning the model with reflection tokens is feasible.
Source
Advantages
- Adaptive retrieval: skip when not needed.
- Inline self-evaluation grounds generation.
Disadvantages
- Requires fine-tuning; not zero-shot.
- Reflection-token quality bounded by training data.