CDC-Driven Vector Sync
Treat the source-of-truth document store as the only writer; keep the vector index in sync by emitting change-data-capture events onto a queue that...
Intent & Description
🎯 Intent
Treat the source-of-truth document store as the only writer; keep the vector index in sync by emitting change-data-capture events onto a queue that the feature pipeline consumes.
📋 Context
A RAG system reads from a vector index built over a corpus that lives in a source-of-truth store (database, document system, content platform). The corpus changes continuously — inserts, updates, deletes. The vector index must stay in sync or retrieval returns stale or missing material.
💡 Solution
Enable change-data-capture on the source-of-truth store (MongoDB change streams, PostgreSQL logical replication, Kafka Connect, Debezium). Publish each change as an event to a queue (Kafka, RabbitMQ, SNS). The feature pipeline subscribes: on insert, embed and upsert; on update, re-embed and overwrite; on delete, remove from the vector index. The writer code knows nothing about embeddings. The pipeline can be paused, redeployed, or backfilled from queue history.
Real-world Use Case
- Vector index must reflect a corpus that changes continuously.
- Source-of-truth store supports CDC (change streams, logical replication, Debezium).
- Eventual consistency on retrieval (seconds-to-minutes lag) is acceptable.
Source
Advantages
- Single writer to the source; embeddings follow as an asynchronous derived view.
- Vector index drift bounded by queue lag, not by rebuild cadence.
- Feature pipeline is independently scalable, debuggable, and replayable.
Disadvantages
- CDC infrastructure to operate (Debezium, Kafka Connect, change streams).
- Eventually-consistent retrieval — the gap between source write and vector update is non-zero.
- Schema changes on the source need coordinated migrations in the embedding pipeline.