Context Window Packing
Allocate a fixed token budget across system prompt, history, retrieved chunks, and tools on every call — so the window never silently overflows.
Intent & Description
🎯 Intent
Choose what fits in the context window each turn given a fixed token budget.
📋 Context
Everything the model needs for the next call — system prompt, conversation history, retrieved chunks, tool definitions, current state — has grown past the model’s maximum context window. Every single call now requires explicit decisions about what goes in and what stays out.
💡 Solution
Define a packing policy. Reserve N tokens for system + tools + response. Allocate the rest across history (compressed), retrieved chunks (top-k after rerank), and current state. Apply eviction (drop oldest), summarization (compress), or selection (relevance-rank) policies. Audit token counts before each call.
Real-world Use Case
- Naive concatenation overflows the context window for realistic inputs.
- Some context (system prompt, tools, response reservation) is fixed and the rest must be allocated dynamically.
- Token counts can be audited before each call and the policy can be adjusted.
Source
📌 TL;DR
Define a token budget and enforce it explicitly on every call — predictable window behavior beats silent overflow every time.
Advantages
- Predictable, deterministic behavior at the window edge — no surprise truncation.
- Inspectable trade-offs — you can see exactly what got included and why.
Disadvantages
- Packing logic adds implementation complexity that grows with the number of context sources.
- Compression artifacts can degrade coherence in ways that are hard to detect.