World Model on Million-Length Video And Language With RingAttention
Abstract
Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoPoet: A Large Language Model for Zero-Shot Video Generation (2023)
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization (2024)
- Generative Multimodal Models are In-Context Learners (2023)
- Text-Conditioned Resampler For Long Form Video Understanding (2023)
- Distilling Vision-Language Models on Millions of Videos (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
super weird Gemini 1.5 is receiving all th praise for its long-context capabilities, when you guys achieved the same thing, argubly more efficiently in dense models with slightly better performance is WILD. so flowers to you guys....GREAT WORK!
Hi folks, I'm trying to understand Ring Attention. Is my reading of this approach correct?
- Start with N devices connected together in a ring topology
- Each of the N devices is assigned a sequence-block and is responsible for computing the e2e output from input to output for some (all?) layers - e.g. this is a form of sequence parallelism
- I'm guessing there's still standard pipeline parallelism? E.g. we still have different groups/pods of devices assigned to different groups of layers?
- For each sequence block, concurrently calculate the partial softmaxed attention scores (which requires cycling through each set of kv-blocks)
- At each inner-round (to cycle through the kv-blocks), calculate + accumulate the partial attention scores for the current kv-block we hold (GEMM, compute bound)
- Simultaneously, send/recv (via triple buffering?) the kv-blocks for the next round.
- As soon as the final attention is available, start the blockwise FFN
There's an optimal chunk size that varies with the link bandwidth to achieve communication-computation overlap (assuming a fixed device profile for GEMM compute throughput), with the intuition that slower links require larger minimum chunk sizes.
In particular, this is different from "brute-force" sequence parallelism prior to Ring Attention in that this doesn't just depend on naive Scatter/All-Gather schemes which:
- Superimposes a forced synchronization point (for K) before the $qK^T$ (to be fair, RingAttention still need to synchronize before we use k,v, but it can still incrementally advance the partial attention without waiting for the All-Gather to complete, making it possible to fully overlap communication), and
- May cause the per-device attention memory usage to be unbounded as the sequence length increases to infinity
Is that the right idea - This is the only way to fully overlap communication/compute through a progressive blockwise attention scheme, while keeping the per-device memory usage bounded.
Additionally, how are the RoPE base frequencies tuned? Do they follow some specific scaling recipe (e.g. https://huggingface.co/papers/2310.05209)?
Unifying Video and Language Understanding with RingAttention
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 79
Browse 79 models citing this paperDatasets citing this paper 0
No dataset linking this paper