Cluster-Capped Insight Store
Cap insights per stem-token cluster and archive the oldest near-duplicates so the active store holds the current research edge — not a graveyard of variants.
Intent & Description
🎯 Intent
An append-only insight store that never evicts accumulates near-duplicate notes on the same topic until retrieval noise drowns the signal.
📋 Context
A long-lived agent writes small insight notes continuously over weeks. On recurring topics it produces slightly different versions of the same note rather than locating and updating the old one. The store fills with clusters of near-duplicates; older genuine insights become invisible.
💡 Solution
A periodic consolidation job scans the insight directory, groups files by the first two stem tokens of their ID (e.g. affect-substrate-, completion-narration-), and for any cluster above MAX_PER_CLUSTER keeps the N newest by mtime. Older files move to archive/insights-dedup-
Real-world Use Case
- Insights are written continuously and near-duplicates accumulate on recurring topics.
- An LLM-merge approach is too expensive or too opaque for the use case.
- Stem-token clustering is a reasonable proxy for topical similarity in the corpus.
Source
Advantages
- Active store stays current — the research edge, not a variant graveyard
- Mechanical clustering has no model cost and is fully auditable
- Archive preserves older variants for forensics when needed
Disadvantages
- Stem-token clustering will sometimes split related insights or merge unrelated ones
- The cap is opinionated — bad cluster boundaries lose useful older work
- Storage still grows because the archive is preserved, just organized