Byte-Pair Encoding (BPE) Tokenization
Build a vocabulary of subword units from corpus frequency — common words get single tokens, rare words decompose into known pieces, zero OOV failures.
Intent & Description
🎯 Intent
Handle any text — including rare and unseen words — without ever hitting an unknown token, by decomposing unfamiliar text into learned subword pieces.
📋 Context
Word-level tokenization produces massive vocabularies and fails on rare words. Character-level handles everything but explodes sequence length. BPE finds the middle ground — common words become single tokens, rare ones decompose into subword pieces the model already knows.
💡 Solution
Start with a character-level vocabulary. Iteratively merge the most frequent adjacent token pair into a new compound token, adding it to the vocabulary. Repeat until target vocab size is reached (typically 32K–100K). Byte-level BPE (GPT-2/4 tokenizer) starts from raw bytes — guaranteeing zero OOV on any Unicode input whatsoever.
Real-world Use Case
📌 TL;DR
Subword tokenization handles any text by merging frequent pairs. Smaller vocabulary = longer sequences — evaluate fragmentation on your domain before committing.
Advantages
- Handles any rare or unseen word by decomposing it into known pieces — no OOV
- Vocabulary size is a tunable parameter — balance sequence length against embedding table size
- Byte-level BPE eliminates OOV entirely — any Unicode input is always encodable
Disadvantages
- Domain-specific terms may fragment into many tokens — inflating sequence length and wasting context
- Tokenization is model-specific — mismatching tokenizer to model corrupts input silently
- Vocabulary size trades off sequence efficiency against embedding table memory