Crawler Dispatcher
Route each incoming URL to a domain-specific crawler via a central dispatcher — adding a new source is just registering a class.
Intent & Description
🎯 Intent
Keep per-source crawling logic isolated so sources evolve and are tested independently.
📋 Context
Your LLM pipeline ingests from LinkedIn, Medium, GitHub, Substack, and internal sites. Each has its own auth, pagination, rate limits, and quirks. Without structure, the ingestion code turns into one giant if-else nightmare.
💡 Solution
Define a Crawler interface (fetch(url) → document). Implement one crawler class per source. A Dispatcher holds a registry of (URL pattern → crawler class). dispatcher.get_crawler(url) returns the right instance. Adding a new source = dispatcher.register(pattern, CrawlerClass). The dispatcher stays small and stable; crawlers evolve independently.
Real-world Use Case
- Many heterogeneous sources need ingestion and more get added frequently.
- Per-source logic differs enough that sharing code creates more problems than it solves.
- Tests for one crawler should never import or depend on another.
Source
📌 TL;DR
One dispatcher, many crawlers. Route by URL pattern. Add sources by registering a class. Per-source logic stays isolated and testable.
Advantages
- Adding a source is a registration call, not a module edit — zero blast radius.
- Per-source crawlers evolve and are tested in isolation.
- Dispatch logic is one small, reviewable surface.
Disadvantages
- URL pattern matching gets ambiguous when sources share the same host.
- Cross-source coordination (e.g., shared rate-limit budgets) needs a layer above the dispatcher.
- Registry drift if registrations scatter across many files with no startup audit.