File size: 718 Bytes
9cfdaa9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# LongVA
<p align="center">
    <img src="vision_niah/niah_output/LongVA-7B/heatmap.png" width="800">
</p>

<p align="center">
    ๐ŸŒ <a href="https://lmms-lab.github.io/posts/longva/" target="_blank">Blog</a> | ๐Ÿ“ƒ <a href="https://arxiv.org/abs/2406.16852" target="_blank">Paper</a> | ๐Ÿค— <a href="https://huggingface.co/collections/lmms-lab/longva-667538e09329dbc7ea498057" target="_blank">Hugging Face</a> | ๐ŸŽฅ <a href="https://longva-demo.lmms-lab.com/" target="_blank">Demo</a>
</p>

Long context capability can **zero-shot transfer** from language to vision.

LongVA can process **2000** frames or over **200K** visual tokens. It achieves **state-of-the-art** performance on Video-MME among 7B models.