Attention-Manipulation Explainability
Identify which input tokens actually drove a model output by perturbing attention weights and measuring probability shifts — no self-reported confabulation.
Intent & Description
🎯 Intent
Surface which input tokens caused a given output by perturbing attention across all transformer layers and measuring the resulting change in output probability — producing a per-token relevance map alongside the model’s response.
📋 Context
In regulated settings — lending, healthcare, legal decisions — stakeholders need evidence about what drove an output, not a generated paragraph of self-justification. LLMs confabulate their reasons; attention perturbation doesn’t.
💡 Solution
Run a structured perturbation pass: for each input token (or chunk), suppress its attention contribution and measure the change in output token probabilities. Tokens whose suppression most reduces output probability are the most relevant. Surface this as a heat-map alongside the answer. Keep attribution on the inference side — never ask the model to self-explain in prose.
Real-world Use Case
- You need a faithful per-token relevance map of which inputs actually caused a given output.
- You control inference (open weights or a provider exposing attention perturbation).
- Free-text self-explanations are insufficient because the model confabulates its reasoning.
Source
📌 TL;DR
Stop asking the model to explain itself — perturb its attention weights instead and get a heat-map of what actually drove the output.
Advantages
- Faithful (mechanistic) attribution — not a post-hoc story the model made up.
- Compatible with audit and right-to-explanation regulatory requirements.
- User-visible heat-maps build calibrated trust rather than blind faith.
Disadvantages
- Requires white-box access to attention weights — not available for hosted black-box APIs.
- Compute overhead per request — one forward pass per token group.
- Token-level attribution can mislead when reasoning spans many tokens collaboratively.