Mixture of Experts (MoE)
Replace the dense FFN with N parallel expert networks routed per-token — massive total parameter counts with constant active parameters per token. Sparse compute, dense capability.
Intent & Description
🎯 Intent
Scale total model parameters beyond what fits in active compute budget — use a router to activate only a small fraction of experts per token, getting large-model quality at smaller-model compute cost per token.
📋 Context
Dense LLMs activate all parameters for every token — doubling parameters doubles compute. MoE decouples total parameters from active parameters: a model can have 400B total parameters while only activating 40B per token. Mixtral 8x7B has 46.7B total parameters but only 12.9B active per token — quality above Llama 2 70B at a fraction of the inference compute.
💡 Solution
Replace each dense FFN layer with N expert FFN networks (typically N=8 or 64). Add a learned router (a small linear layer) that takes the token representation and outputs softmax scores over experts. Select top-K experts per token (usually K=2). Compute the token’s representation as a weighted sum of the top-K expert outputs. Training includes a load-balancing auxiliary loss to prevent router collapse (all traffic to one expert). Modern implementations: Mixtral 8x7B, Mixtral 8x22B, Grok-1, DeepSeek-V3.
Real-world Use Case
📌 TL;DR
N FFN experts, top-K active per token. Massive total params at constant active compute — big-model quality, small-model running cost. The architecture behind Mixtral, Grok-1, and DeepSeek-V3.
Advantages
- Large-model quality at small-model per-token compute cost — Mixtral 8x7B outperforms Llama 2 70B at ~1/5 the inference FLOPs
- Total parameters scale with number of experts — quality scales without proportional compute scaling
- Composable with GQA and Flash Attention on the attention side without structural changes
Disadvantages
- Total memory scales with all expert parameters — serving a 400B MoE requires loading all 400B into VRAM/RAM
- Load balancing is non-trivial — poor router training causes expert collapse and quality degradation
- Communication overhead in distributed serving — expert parallelism requires all-to-all communication between GPUs