A collection of papers that I found useful for learning about using Sparse Autoencoders for finding interpretable features in language models
-
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Paper • 2309.08600 • Published • 13 -
Scaling and evaluating sparse autoencoders
Paper • 2406.04093 • Published • 2 -
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Paper • 2403.19647 • Published • 3 -
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Paper • 2408.05147 • Published • 37