Sleep-Time Compute
During idle periods, pre-compute dense summaries and likely future answers against the user's standing context — so test-time latency and cost drop dramatically on cache hits.
Intent & Description
🎯 Intent
During idle or downtime, run the model offline against the user’s standing context to pre-compute dense summaries and likely future answers — so test-time latency and cost drop when the user actually asks.
📋 Context
You’re running an agent over persistent user context — a codebase, document set, prior session transcripts — that users query repeatedly. Many queries are predictable variants of previous ones, and the corpus doesn’t change between most of them. Idle capacity exists between sessions when no one is waiting for an answer.
💡 Solution
Two offline pass types. (1) Distillation: compress the corpus into structured summaries — per-file, per-module, per-topic — capturing what queries would likely need. (2) Speculative pre-answering: predict likely next queries (from query history, recent context, structural signals), generate answers ahead of time, store against query embeddings. At test time, check the speculative cache first; on a hit, return or lightly adapt the pre-answer; on a miss, fall back to live inference and add the new query to the prediction set. Invalidate pre-computed material when source documents change.
Real-world Use Case
- The agent operates over standing context that changes slowly relative to query volume.
- Idle capacity exists between sessions while test-time inference is peak-cost.
- User queries against the corpus are repetitive or predictable from history.
- Test-time latency matters more than offline compute cost.
Source
📌 TL;DR
Compute summaries and speculative answers during idle time, serve them at test time — shift cost from peak-latency inference to cheap idle compute and make your agent feel instant on common queries.
Advantages
- Test-time latency drops dramatically on cache hits — the answer is already computed.
- Cost shifts from peak (test-time) to trough (idle) capacity pricing.
- Distilled summaries also speed up cold queries by serving as compact retrieval targets.
- Speculative coverage improves over time as the prediction model learns from misses.
Disadvantages
- Offline compute is real cost — predictions that never get asked are wasted spend.
- Stale pre-answers can mislead if invalidation lags corpus changes.
- Privacy implication: pre-answering means the system holds and reasons over user data during idle periods.
- Quality regression if speculative pre-answers are lower-effort than live inference and the agent doesn’t detect the gap.