Interpretability
Select papers on language model interpretability with notes
Paper • 2209.02535 • Published • 3Note works on the model weights (without any input). Projecting the layers and heads into the embeddings space and decoding tokens. Convince yourself in the appendix. Also indicates, that training the same model from frozen embeddings makes it nearly replaceable.
Locating and Editing Factual Associations in GPT
Paper • 2202.05262 • Published • 1Note great overview: https://youtu.be/_NMQyOu2HTo knowledge lies in the Feed forward weights, some knowledge can be editing by a single neuron.
A Multiscale Visualization of Attention in the Transformer Model
Paper • 1906.05714 • Published • 2Note https://github.com/jessevig/bertviz let's you visualize attention per head per layer
BERT Rediscovers the Classical NLP Pipeline
Paper • 1905.05950 • Published • 2Note different layers of a transformer perform better for different NLP tasks like POS tagging, NER, corefernce etc. They use the intermediate representations as inputs to simple classifiers
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper • 2208.01626 • Published • 2Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
Paper • 1605.09096 • Published