fnlp
/

English
Llama-Scope / README.md
Hzfinfdu's picture
Update README.md
6f15b4d verified
|
raw
history blame
3.65 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - meta-llama/Llama-3.1-8B

Llama Scope

Technical Report Link

Use with OpenMOSS lm_sae Github Repo

[Use with SAELens]

[Explore in Neuronpedia]

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 improved TopK SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.

This is a frontpage of all Llama Scope SAEs. Please see the following link for checkpoints.

Naming Convention

L[Layer][Position]-[Expansion]x

For instance, an SAE with 8x the hidden size of Llama-3.1-8B, i.e. 32K features, trained on the 15th post-MLP residual stream is called L15R-8x.

Checkpoints

Llama-3.1-8B-LXR-8x

Llama-3.1-8B-LXA-8x

Llama-3.1-8B-LXM-8x

Llama-3.1-8B-LXTC-8x

Llama-3.1-8B-LXR-32x

Llama-3.1-8B-LXA-32x

Llama-3.1-8B-LXM-32x

Llama-3.1-8B-LXTC-32x

Llama Scope SAE Overview

Llama Scope Scaling Monosemanticity GPT-4 SAE Gemma Scope
Models Llama-3.1 8B (Open Source) Claude-3.0 Sonnet (Proprietary) GPT-4 (Proprietary) Gemma-2 2B & 9B (Open Source)
SAE Training Data SlimPajama Proprietary Proprietary Proprietary, Sampled from Mesnard et al. (2024)
SAE Position (Layer) Every Layer The Middle Layer 5/6 Late Layer Every Layer
SAE Position (Site) R, A, M, TC R R R, A, M, TC
SAE Width (# Features) 32K, 128K 1M, 4M, 34M 128K, 1M, 16M 16K, 64K, 128K, 256K - 1M (Partial)
SAE Width (Expansion Factor) 8x, 32x Proprietary Proprietary 4.6x, 7.1x, 28.5x, 36.6x
Activation Function TopK-ReLU ReLU TopK-ReLU JumpReLU

Citation

Please cite as:

@article{he2024llamascope,
  title={Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders},
  author={He, Zhengfu and Shu, Wentao and Ge, Xuyang and Chen, Lingjie and Wang, Junxuan and Zhou, Yunhua and Liu, Frances and Guo, Qipeng and Huang, Xuanjing and Wu, Zuxuan and others},
  journal={arXiv preprint arXiv:2410.20526},
  year={2024}
}