@beomi on Hugging Face: "🚀 **InfiniTransformer, Gemma/Llama3 based Implementation!** 🌌

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

beomi

posted an update Apr 18

Post

12233

🚀 **InfiniTransformer, Gemma/Llama3 based Implementation!** 🌌

> Update @ 2024.04.19: It now supports Llama-3!

> Note: this implementation is unofficial

This implementation is designed to handle virtually infinite context lengths.

Here's the github repo: https://github.com/Beomi/InfiniTransformer

📄 **Read the original Paper:** https://arxiv.org/abs/2404.07143

## **Focus on Infini-Attention**

- **2 Types of Implementation available:** Attention-layer only implementation / Model & Train-wise implementation
- **Fixed(segment dependent) Memory Usage:** Enables training on larger models and longer sequences without the memory overhead typical of standard Transformer implementations.
- **Infinite Context Capability:** Train with unprecedented sequence lengths—imagine handling up to 1 million sequence lengths on standard hardware!
- You could train Gemma-2B with 1M sequence length with 2K segmentation size with single H100 GPU.

## **Try InfiniTransformer**

1. **Clone the repository:**

bash
   git clone https://github.com/Beomi/InfiniTransformer

2. **Install necessary tools:**

bash
   pip install -r requirements.txt
   pip install -e git+https://github.com/huggingface/transformers.git@b109257f4f#egg=transformers

3. **Dive Deep into Custom Training:**
- Train with extensive sequence lengths using scripts such as ./train.gemma.infini.noclm.1Mseq.sh.

for more detailed info, please visit Repo: https://github.com/Beomi/InfiniTransformer

Look forward to see your feedbacks! 😊

ps. Training loss plot is here 😉

tomaarsen

Apr 18

Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?

beomi

Apr 18

I'm testing on it, with 32K(claimed at paper) and 1M seq len.

I'm training those models with minipile dataset and for now, it seems minimal continual training let model to adapt 'memory' could be sufficient.(less than <1B tokens)

Train is not finished yet, but after the loss converges then I could test haystack test or inference tests. it won't be take long :)

In this post