Domain-Adaptive Tokenization
Extend or retrain the tokenizer on your domain text before fine-tuning — fewer tokens per concept means more content fits in the context window.
Intent & Description
🎯 Intent
A general tokenizer fragments domain-specific terms into many subword pieces — wasting context window tokens and degrading model performance on domain tasks.
📋 Context
GPT-4’s tokenizer fragments medical terms like “hypertriglyceridemia” into 7+ tokens and Python identifiers into multiple pieces. Every fragmented term means fewer real concepts fit in context, and the model sees arbitrary splits that don’t reflect domain structure.
💡 Solution
Collect a domain corpus (medical lit, code repos, legal docs). Train BPE or Unigram tokenizer on domain text to surface high-frequency domain tokens. Merge new domain tokens into the base vocabulary (vocab expansion). Fine-tune the model’s embedding table for new tokens while keeping base weights frozen. Measure token-per-word ratio before and after on representative domain text to quantify win.
Real-world Use Case
📌 TL;DR
When the tokenizer turns your domain’s vocabulary into noise — extend it before fine-tuning. Fewer tokens per concept means more context, faster training, better performance.
Advantages
- Reduces sequence length for domain text — more content fits in the context window
- Model sees linguistically meaningful token boundaries, not arbitrary subword splits
- Improves downstream task performance on domain-specific benchmarks
Disadvantages
- Vocabulary expansion requires re-training or fine-tuning the embedding layer — not free
- New tokens start with randomly initialized embeddings needing warmup steps to converge
- Larger vocabulary grows the embedding matrix and slows training