Post
542
Good folks from Universitat Politècnica de Catalunya, University of Groningen, and Meta have released "A Primer on the Inner Workings of Transformer-based Language Models."
They don't make survey papers like they used to, but this is an exciting new survey on Transformer LM interpretability!
This comprehensive survey provides a technical deep dive into:
• Transformer architecture components (attention, FFN, residual stream)
• Methods for localizing model behavior:
- Input attribution (gradient & perturbation-based)
- Component importance (logit attribution, causal interventions)
• Information decoding techniques:
- Probing, linear feature analysis
- Sparse autoencoders for disentangling features
• Key insights on model internals:
- Attention mechanisms (induction heads, copy suppression)
- FFN neuron behaviors
- Residual stream properties
- Multi-component emergent behaviors
The paper offers a unified notation and connects insights across different areas of interpretability research. It's a must-read for anyone working on understanding large language models!
Some fascinating technical highlights:
- Detailed breakdowns of attention head circuits (e.g., IOI task)
- Analysis of factual recall mechanisms
- Overview of polysemanticity and superposition
- Discussion of grokking as circuit emergence
What interpretability insights do you find most intriguing?
They don't make survey papers like they used to, but this is an exciting new survey on Transformer LM interpretability!
This comprehensive survey provides a technical deep dive into:
• Transformer architecture components (attention, FFN, residual stream)
• Methods for localizing model behavior:
- Input attribution (gradient & perturbation-based)
- Component importance (logit attribution, causal interventions)
• Information decoding techniques:
- Probing, linear feature analysis
- Sparse autoencoders for disentangling features
• Key insights on model internals:
- Attention mechanisms (induction heads, copy suppression)
- FFN neuron behaviors
- Residual stream properties
- Multi-component emergent behaviors
The paper offers a unified notation and connects insights across different areas of interpretability research. It's a must-read for anyone working on understanding large language models!
Some fascinating technical highlights:
- Detailed breakdowns of attention head circuits (e.g., IOI task)
- Analysis of factual recall mechanisms
- Overview of polysemanticity and superposition
- Discussion of grokking as circuit emergence
What interpretability insights do you find most intriguing?