Streaming Feature Pipeline
Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
Intent & Description
🎯 Intent
Process raw documents into RAG features as a continuous stream rather than a batch job, with typed models pinning each stage.
📋 Context
An LLM application’s vector index must stay close to the live state of an evolving corpus. Batch rebuilds run every N hours and lag the source. The team wants the pipeline to consume change events as they happen and update the index immediately.
💡 Solution
Use a streaming framework (Bytewax, Flink, Kafka Streams) to consume change events. Define a Pydantic (or equivalent) model per stage: RawDocument → CleanedDocument → ChunkedDocument → EmbeddedDocument. Each stage is a map operation that takes one model and emits the next; type errors surface at the stage boundary. Failed events go to a dead-letter queue for inspection rather than blocking the stream. Upserts to the vector index happen as the embedded model flows out of the last stage.
Real-world Use Case
- Real-time RAG ingest is needed and batch lag is unacceptable.
- Source events can be modelled as a stream (CDC, webhook, queue).
- Engineering capacity to operate a streaming framework exists.
Source
Advantages
- Vector index lag bounded by stream throughput, not batch cadence.
- Typed stage transitions surface shape drift immediately.
- Failed events isolate to DLQ; the stream continues.
Disadvantages
- Streaming framework to operate (Bytewax, Flink, etc.).
- Per-stage type models add boilerplate.
- Backfill of historical corpus needs a separate pipeline or replay strategy.