update readme
Browse files
README.md
CHANGED
@@ -1,3 +1,28 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
# Leopard-Idefic2
|
7 |
+
|
8 |
+
[Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
|
9 |
+
|
10 |
+
## Summaries
|
11 |
+
|
12 |
+
Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length.
|
13 |
+
To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios
|
14 |
+
and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.
|
15 |
+
|
16 |
+
## Architectures
|
17 |
+
|
18 |
+
For LEOPARD-Idefics2, we follow the architecture of Idefics2-8B which uses SigLIP-SO-400M as the visual en- coder but increases its image resolution to 980×980 to make the text legible. The features outputted by the visual encoder are compressed with a feature resampler into 64 tokens per image. Idefics2-8B adopts the Mistral-7B as the LM.
|
19 |
+
|
20 |
+
## Citation
|
21 |
+
```
|
22 |
+
@article{jia2024leopard,
|
23 |
+
title={LEOPARD: A Vision Language Model For Text-Rich Multi-Image Tasks},
|
24 |
+
author={Jia, Mengzhao and Yu, Wenhao and Ma, Kaixin and Fang, Tianqing and Zhang, Zhihan and Ouyang, Siru and Zhang, Hongming and Jiang, Meng and Yu, Dong},
|
25 |
+
journal={arXiv preprint arXiv:2410.01744},
|
26 |
+
year={2024}
|
27 |
+
}
|
28 |
+
```
|