Abstract
This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.
Community
The proposed regularization technique makes training more efficient due to embedding linearisation control
Absolute chads at SberAI for still releasing after the war started, regardless of one's political stance, I respect them a lot for not just cancelling their research division or not putting anything on arxiv anymore.
However, the major work is done in AIRIπ We love science and there are no limits for the job you love. Thank you for kind words
Very cool! Summary here (feedback welcome!): https://www.aimodels.fyi/papers/arxiv/your-transformer-is-secretly-linear
Title inspired by this one? ;) https://www.aimodels.fyi/papers/arxiv/from-words-to-numbers-your-large-language
I'm a simple man, I see "secretly linear," I upvote.
Well, from the newer paper by MIT it seems the features are not as linear as it has been thought. https://huggingface.co/papers/2405.14860
In this case, if I get both papers right, linearization can hurt the model by eliminating complex associations, such as days of week, months, years and many other implicit nonlinear features we cannot even know that exist in the model, but directly tied to the model's understanding of the cyclic/curved/jagged parts of the world.
These are different papers: this one studies the linearity between two consecutive transformer block transformations, but the paper by MIT studied embedding linearity within one transformer layer.
MIT VS AIRI LMAO
Is that so? Or should I say: We will see about that!
Working on reproducing this and similar pruning criteria here:
https://github.com/melisa-writer/short-transformers
Linear approximation of the last token is there, along with angular distances, bi score etc.
The goal of the library: choose your distance (layer importance metric), get cropped model. :rocket:
The implications of this work are significant. There is so much to explore.
One thing that I can't quite grasp is how Cosine Similarity regularization manages to control linearity.
Actually this is a challenging outcome, because the hypothesis is when adding cosine similarity to make embeddings more similar (CS -> 1), the training process leads to increasing the non linear part in the residual stream. We plan to investigate this effect more
Your Transformer Might Be Linear! | Deep Dive
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
In the paper
Furthermore, our feature triggering regime hypothesis
proposes that rare specific features on a
few tokens with high non-linearity significantly influence
model behavior β in the Figure 9 one can
see that some layers of OPT-1.3B have the long
tailed distribution of L2 errors, which means that
there are still sparse spikes of non-linearity.
how is this L2 error in Figure 9 here calculated?
o ty! :D
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper