Cure the headache of Transformers via Collinear Constrained Attention
Abstract
As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.
Community
Can anyone explain the Rotary Borders phenomenon more, in particular, even if it's a technical deficiency of RoPE, why do the authors believe addressing it will address the sequence length generalizability problem?
Additionally, are there any standard benchmarks that this is run on? Table 6 for e.g. looks more like incoherence rather than creative output.
@leegao19 Thank you for your interest in our work, I am the author of this paper. We have an updated version with more experiments here: https://arxiv.org/abs/2309.08646. Now I answer the questions that you are curious about.
- Why we believe addressing the abnormal phenomenon of RoPE will address the sequence length generalizability problem? From a theoretical perspective, this phenomenon disrupted the priori hypotheses of language model: For two specific tokens, the attention score should be higher at close than that of far-end without contextual condition, as shown in section [ANOMALOUS BEHAVIOR BETWEEN ROPE AND ATTENTION MATRICES]. On ther other hand, from an empirical perspective, our experiments confirmed this anomaly.
- Are there any standard benchmarks that this is run on? We will post the next version of this work in Feb, more benchmarks will be examined, welcome to continue following.
We will post the next version of this work in Feb, more benchmarks will be examined, welcome to continue following.
Thanks, I'm looking forward to seeing the updates next month!
For two specific tokens, the attention score should be higher at close than that of far-end without contextual condition, as shown in section [ANOMALOUS BEHAVIOR BETWEEN ROPE AND ATTENTION MATRICES].
While the idea that order-inversion degrades self-attention performance makes sense, it's still unclear why this causes, e.g.
- catastrophic attention degradation during extrapolation (relative distance > maximum trained sequence)
- why is there a difference between the performance of interpolation (e.g. PI) vs extrapolation
Could it be that collinearity fixes (forces) generalizability by addressing some other issue? It doesn't really seem like fixing order-inversion would improve extrapolatability (in the o.o.d regime), but rather the performance of self-attention within the i.i.d regime.
At the same time, from an interpretability perspective, it seems like enforcing collinearity between K and Q before positional encoding may decrease their expressiveness, and may lead to potential problems while learning representations to generalize. E.g. it learns how to generalize positional encoding better (learns to use the rotational invariance), but does it give up other crucial capabilities in order to do so?
While the idea that order-inversion degrades self-attention performance makes sense, it's still unclear why this causes, e.g.
catastrophic attention degradation during extrapolation (relative distance > maximum trained sequence)
why is there a difference between the performance of interpolation (e.g. PI) vs extrapolation
Could it be that collinearity fixes (forces) generalizability by addressing some other issue? It doesn't really seem like fixing order-inversion would improve extrapolatability (in the o.o.d regime), but rather the performance of self-attention within the i.i.d regime.
Your insight is really correct, actually we found some causes that significantly affect the capability of extrapolation derived from Collinear Constraint in the next version of our work. It may further answer your question. However, at present, people still have not fully understood the underlying mechanism of attention, we will propose a possible conjecture later.
At the same time, from an interpretability perspective, it seems like enforcing collinearity between K and Q before positional encoding may decrease their expressiveness, and may lead to potential problems while learning representations to generalize. E.g. it learns how to generalize positional encoding better (learns to use the rotational invariance), but does it give up other crucial capabilities in order to do so?
Yes, we are curious about this either, and we examined CoCA's expressiveness in next version. Looking forward to the update. It may further answer your question.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper