Model Merging (Task Vectors / TIES / DARE / SLERP)
Combine specialized fine-tuned models in weight space — average their task deltas, resolve sign conflicts, and drop noise — to get multi-capability models with zero training.
Intent & Description
🎯 Intent
Combine the capabilities of multiple specialized fine-tuned models into a single model without any additional training — by operating directly in weight space on the delta from the shared base.
📋 Context
You have a code-specialized model and a math-specialized model, both fine-tuned from the same base. Training a combined model from scratch is expensive. Model merging arithmetic on the weight deltas can often produce a model with both capabilities at near-additive quality — because the delta updates for orthogonal capabilities don’t interfere much.
💡 Solution
Compute task vectors: τ = θ_finetuned - θ_base for each specialized model. Apply one of:
- Task Arithmetic:
θ_merged = θ_base + Σ(λᵢ × τᵢ)— weighted sum of task vectors. - TIES-Merging: Trim small delta values (noise), elect sign consensus across models, merge only parameters that agree in direction. Handles parameter interference better than simple averaging.
- DARE: Randomly drop (sparsify) delta parameters before merging, then rescale — reduces redundancy and interference when merging many models.
- SLERP: Spherical linear interpolation between two model checkpoints — preserves geometric structure better than linear averaging for two-model merges.
Tooling:
mergekit(open source, supports TIES, DARE, SLERP, Task Arithmetic).
Real-world Use Case
📌 TL;DR
Subtract the base, add the deltas. Combine code + math + instruction models into one in minutes with no training. Use TIES or DARE when simple averaging degrades quality.
Advantages
- Zero training required — combine models in minutes on CPU with mergekit
- Produces standalone deployable checkpoints — no runtime adapter loading overhead
- Can recover capabilities that were degraded by fine-tuning on one task (e.g., restoring general reasoning after code fine-tuning)
Disadvantages
- Parameter interference degrades quality when merged capabilities are not orthogonal
- Merging coefficients (λᵢ, density, weights) require empirical tuning — no analytical solution
- Performance ceiling is bounded by the quality of the individual fine-tuned models