TRAMS: Training-free Memory Selection for Long-range Language Modeling
Abstract
The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models (2023)
- Efficient Streaming Language Models with Attention Sinks (2023)
- Context Compression for Auto-regressive Transformers with Sentinel Tokens (2023)
- Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning (2023)
- Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper