Self-Corpus Vocabulary
Mine the agent's own writing for a small cached vocabulary of its most active concepts — so relevance scoring reflects the agent's own frame, not just generic embedding distance.
Intent & Description
🎯 Intent
Mine a small bounded vocabulary from the agent’s own writing and cache it as the conceptual axis for scoring new thoughts — so relevance reflects the agent’s actual frame rather than a generic embedding space.
📋 Context
A long-running agent accumulates a corpus of its own output: thought traces, insights, journal entries, notes. Downstream components want to score new thoughts for relevance, novelty, or kinship with existing concerns. Generic embedding similarity answers “is this semantically close?” but not “is the agent still pulling at the things it’s been pulling at?” — a meaningfully different question.
💡 Solution
Run a periodic mining pass over the agent’s own corpus (last N weeks of thoughts + long-term insight store). Aggregate frontmatter tags and content frequency to extract the top-N concept tokens with weights. Persist as a small JSON cache. Downstream scoring adds this as an additional axis: a thought is scored on both generic embedding similarity to recent context and overlap with the cached self-vocabulary. Refresh cadence should be proportional to corpus volatility (e.g. weekly for a stable agent, after every consolidation cycle for a volatile one).
Real-world Use Case
- The agent has an own-writing corpus large enough to mine (weeks of accumulated thoughts).
- Downstream scoring needs an own-frame axis beyond generic semantic similarity.
- Refresh cadence is feasible on the deployment’s compute budget.
Source
📌 TL;DR
Mine your agent’s own writing for a cached concept vocabulary and use it as an additional scoring axis — relevance to “what this agent thinks about” beats generic embedding distance for self-aware memory.
Advantages
- Relevance scoring becomes sensitive to the agent’s own conceptual frame, not just generic embedding space.
- Vocabulary changes are visible and auditable — operators can see what the agent is currently “about.”
- Small footprint (top-N tokens) is cheap to load and use in scoring.
Disadvantages
- Frame lock-in — a stale vocabulary reinforces what the agent already knows at the expense of new directions.
- Mining is opinionated; tag-vs-frequency weighting is a tuning decision.
- If the corpus is too small, the extracted vocabulary is noisy and unreliable.