Papers
arxiv:2410.20526

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Published on Oct 27
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that feature splitting enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~https://huggingface.co/fnlp/Llama-Scope, alongside our scalable training, interpretation, and visualization tools at https://github.com/OpenMOSS/Language-Model-SAEs. These contributions aim to advance the open-source Sparse Autoencoder ecosystem and support mechanistic interpretability research by reducing the need for redundant SAE training.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.20526 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.20526 in a Space README.md to link it from this page.

Collections including this paper 1