Mixture of Experts (MoE) | designpattern.fyi

Back to Catalog

Advantages

Large-model quality at small-model per-token compute cost — Mixtral 8x7B outperforms Llama 2 70B at ~1/5 the inference FLOPs
Total parameters scale with number of experts — quality scales without proportional compute scaling
Composable with GQA and Flash Attention on the attention side without structural changes

Disadvantages

Total memory scales with all expert parameters — serving a 400B MoE requires loading all 400B into VRAM/RAM
Load balancing is non-trivial — poor router training causes expert collapse and quality degradation
Communication overhead in distributed serving — expert parallelism requires all-to-all communication between GPUs