Self-Corpus Vocabulary

Agentic AI Memory

Mine the agent's own writing for a small cached vocabulary of its most active concepts — so relevance scoring reflects the agent's own frame, not just generic embedding distance.

Self-Corpus Vocabulary gives the agent a self-awareness axis for memory scoring: a periodic mining pass over the agent’s own corpus (thought traces, insights, journal entries) extracts the top-N concept tokens with weights and caches them as a small JSON file — downstream scoring components use this vocabulary as an additional axis alongside generic embedding similarity, answering “is this relevant to what this agent is preoccupied with?” rather than just “is this semantically close?”

Intent & Description

🎯 Intent

Mine a small bounded vocabulary from the agent’s own writing and cache it as the conceptual axis for scoring new thoughts — so relevance reflects the agent’s actual frame rather than a generic embedding space.

📋 Context

A long-running agent accumulates a corpus of its own output: thought traces, insights, journal entries, notes. Downstream components want to score new thoughts for relevance, novelty, or kinship with existing concerns. Generic embedding similarity answers “is this semantically close?” but not “is the agent still pulling at the things it’s been pulling at?” — a meaningfully different question.

💡 Solution

Run a periodic mining pass over the agent’s own corpus (last N weeks of thoughts + long-term insight store). Aggregate frontmatter tags and content frequency to extract the top-N concept tokens with weights. Persist as a small JSON cache. Downstream scoring adds this as an additional axis: a thought is scored on both generic embedding similarity to recent context and overlap with the cached self-vocabulary. Refresh cadence should be proportional to corpus volatility (e.g. weekly for a stable agent, after every consolidation cycle for a volatile one).

Real-world Use Case

The agent has an own-writing corpus large enough to mine (weeks of accumulated thoughts).
Downstream scoring needs an own-frame axis beyond generic semantic similarity.
Refresh cadence is feasible on the deployment’s compute budget.

Source

View Original Source →

📌 TL;DR

Mine your agent’s own writing for a cached concept vocabulary and use it as an additional scoring axis — relevance to “what this agent thinks about” beats generic embedding distance for self-aware memory.

Advantages

Relevance scoring becomes sensitive to the agent’s own conceptual frame, not just generic embedding space.
Vocabulary changes are visible and auditable — operators can see what the agent is currently “about.”
Small footprint (top-N tokens) is cheap to load and use in scoring.

Disadvantages

Frame lock-in — a stale vocabulary reinforces what the agent already knows at the expense of new directions.
Mining is opinionated; tag-vs-frequency weighting is a tuning decision.
If the corpus is too small, the extracted vocabulary is noisy and unreliable.

97 of 329