File size: 7,873 Bytes
c6547f4 2528f95 c6547f4 5c0e426 c6547f4 2528f95 c6547f4 5c0e426 c6547f4 f4df4fa c6547f4 0c0a987 c6547f4 5c0e426 c6547f4 5c0e426 c6547f4 5c0e426 c6547f4 5c0e426 c6547f4 a493283 c6547f4 5c0e426 c6547f4 a493283 c6547f4 5c0e426 c6547f4 5c0e426 21f2ce1 5c0e426 c6547f4 5c0e426 c6547f4 5c0e426 c6547f4 5c0e426 c6547f4 5c0e426 fbea248 5c0e426 f4df4fa 5c0e426 28cc1fd 5c0e426 f4df4fa 28cc1fd 5c0e426 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
base_model: TIGER-Lab/Mantis-8B-siglip-llama3-pretraind
tags:
- multimodal
- lmm
- vlm
- llava
- siglip
- llama3
- mantis
model-index:
- name: Mantis-8B-siglip-llama3
results: []
license: llama3
datasets:
- TIGER-Lab/Mantis-Instruct
language:
- en
---
# 🔥 Mantis (TMLR 2024)
[Paper](https://arxiv.org/abs/2405.01483) |
[Website](https://tiger-ai-lab.github.io/Mantis/) |
[Github](https://github.com/TIGER-AI-Lab/Mantis) |
[Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) |
[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) |
[Wandb](https://api.wandb.ai/links/dongfu/iydlrqgj)
![Mantis](https://tiger-ai-lab.github.io/Mantis/images/radar_chart.png)
## Summary
- Mantis is an LLaMA-3 based LMM with **interleaved text and image as inputs**, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
- Mantis is trained to have multi-image skills including co-reference, reasoning, comparing, temporal understanding.
- Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.
## Multi-Image Performance
| Models | Size | Format | NLVR2 | Q-Bench | Mantis-Eval | BLINK | MVBench | Avg |
|--------------------|:----:|:--------:|:-----:|:-------:|:-----------:|:-----:|:-------:|:----:|
| GPT-4V | - | sequence | 88.80 | 76.52 | 62.67 | 51.14 | 43.50 | 64.5 |
| Open Source Models | | | | | | | | |
| Random | - | - | 48.93 | 40.20 | 23.04 | 38.09 | 27.30 | 35.5 |
| Kosmos2 | 1.6B | merge | 49.00 | 35.10 | 30.41 | 37.50 | 21.62 | 34.7 |
| LLaVA-v1.5 | 7B | merge | 53.88 | 49.32 | 31.34 | 37.13 | 36.00 | 41.5 |
| LLava-V1.6 | 7B | merge | 58.88 | 54.80 | 45.62 | 39.55 | 40.90 | 48.0 |
| Qwen-VL-Chat | 7B | merge | 58.72 | 45.90 | 39.17 | 31.17 | 42.15 | 43.4 |
| Fuyu | 8B | merge | 51.10 | 49.15 | 27.19 | 36.59 | 30.20 | 38.8 |
| BLIP-2 | 13B | merge | 59.42 | 51.20 | 49.77 | 39.45 | 31.40 | 46.2 |
| InstructBLIP | 13B | merge | 60.26 | 44.30 | 45.62 | 42.24 | 32.50 | 45.0 |
| CogVLM | 17B | merge | 58.58 | 53.20 | 45.16 | 41.54 | 37.30 | 47.2 |
| OpenFlamingo | 9B | sequence | 36.41 | 19.60 | 12.44 | 39.18 | 7.90 | 23.1 |
| Otter-Image | 9B | sequence | 49.15 | 17.50 | 14.29 | 36.26 | 15.30 | 26.5 |
| Idefics1 | 9B | sequence | 54.63 | 30.60 | 28.11 | 24.69 | 26.42 | 32.9 |
| VideoLLaVA | 7B | sequence | 56.48 | 45.70 | 35.94 | 38.92 | 44.30 | 44.3 |
| Emu2-Chat | 37B | sequence | 58.16 | 50.05 | 37.79 | 36.20 | 39.72 | 44.4 |
| Vila | 8B | sequence | 76.45 | 45.70 | 51.15 | 39.30 | 49.40 | 52.4 |
| Idefics2 | 8B | sequence | 86.87 | 57.00 | 48.85 | 45.18 | 29.68 | 53.5 |
| Mantis-CLIP | 8B | sequence | 84.66 | 66.00 | 55.76 | 47.06 | 48.30 | 60.4 |
| Mantis-SIGLIP | 8B | sequence | 87.43 | 69.90 | **59.45** | 46.35 | 50.15 | 62.7 |
| Mantis-Flamingo | 9B | sequence | 52.96 | 46.80 | 32.72 | 38.00 | 40.83 | 42.3 |
| Mantis-Idefics2 | 8B | sequence | **89.71** | **75.20** | 57.14 | **49.05** | **51.38** | **64.5** |
| $\Delta$ over SOTA | - | - | +2.84 | +18.20 | +8.30 | +3.87 | +1.98 | +11.0 |
## Single-Image Performance
| Model | Size | TextVQA | VQA | MMB | MMMU | OKVQA | SQA | MathVista | Avg |
|-----------------|:----:|:-------:|:----:|:----:|:----:|:-----:|:----:|:---------:|:----:|
| OpenFlamingo | 9B | 46.3 | 58.0 | 32.4 | 28.7 | 51.4 | 45.7 | 18.6 | 40.2 |
| Idefics1 | 9B | 39.3 | 68.8 | 45.3 | 32.5 | 50.4 | 51.6 | 21.1 | 44.1 |
| InstructBLIP | 7B | 33.6 | 75.2 | 38.3 | 30.6 | 45.2 | 70.6 | 24.4 | 45.4 |
| Yi-VL | 6B | 44.8 | 72.5 | 68.4 | 39.1 | 51.3 | 71.7 | 29.7 | 53.9 |
| Qwen-VL-Chat | 7B | 63.8 | 78.2 | 61.8 | 35.9 | 56.6 | 68.2 | 15.5 | 54.3 |
| LLaVA-1.5 | 7B | 58.2 | 76.6 | 64.8 | 35.3 | 53.4 | 70.4 | 25.6 | 54.9 |
| Emu2-Chat | 37B | <u>66.6</u> | **84.9** | 63.6 | 36.3 | **64.8** | 65.3 | 30.7 | 58.9 |
| CogVLM | 17B | **70.4** | <u>82.3</u> | 65.8 | 32.1 | <u>64.8</u> | 65.6 | 35.0 | 59.4 |
| Idefics2 | 8B | 70.4 | 79.1 | <u>75.7</u> | **43.0** | 53.5 | **86.5** | **51.4** | **65.7** |
| Mantis-CLIP | 8B | 56.4 | 73.0 | 66.0 | 38.1 | 53.0 | 73.8 | 31.7 | 56.0 |
| Mantis-SigLIP | 8B | 59.2 | 74.9 | 68.7 | 40.1 | 55.4 | 74.9 | 34.4 | 58.2 |
| Mantis-Idefics2 | 8B | 63.5 | 77.6 | 75.7 | <u>41.1</u> | 52.6 | <u>81.3</u> | <u>40.4</u> | <u>61.7</u> |
## How to use
### Installation
```bash
# This only installs minimum packages (torch, transformers, accelerate) for inference, no redundant packages are installed.
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
```
### Run example inference:
```python
from mantis.models.mllava import chat_mllava
from PIL import Image
import torch
image1 = "image1.jpg"
image2 = "image2.jpg"
images = [Image.open(image1), Image.open(image2)]
# load processor and model
from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3")
attn_implementation = None # or "flash_attention_2"
model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3", device_map="cuda", torch_dtype=torch.bfloat16, attn_implementation=attn_implementation)
generation_kwargs = {
"max_new_tokens": 1024,
"num_beams": 1,
"do_sample": False
}
# chat
text = "Describe the difference of <image> and <image> as much as you can."
response, history = chat_mllava(text, images, model, processor, **generation_kwargs)
print("USER: ", text)
print("ASSISTANT: ", response)
text = "How many wallets are there in image 1 and image 2 respectively?"
response, history = chat_mllava(text, images, model, processor, history=history, **generation_kwargs)
print("USER: ", text)
print("ASSISTANT: ", response)
"""
USER: Describe the difference of <image> and <image> as much as you can.
ASSISTANT: The second image has more variety in terms of colors and designs. While the first image only shows two brown leather pouches, the second image features four different pouches in various colors and designs, including a purple one with a gold coin, a red one with a gold coin, a black one with a gold coin, and a brown one with a gold coin. This variety makes the second image more visually interesting and dynamic.
USER: How many wallets are there in image 1 and image 2 respectively?
ASSISTANT: There are two wallets in image 1, and four wallets in image 2.
"""
```
### Training
See [mantis/train](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/train) for details
### Evaluation
See [mantis/benchmark](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/benchmark) for details
**Please cite our paper or give a star to out Github repo if you find this model useful**
## Citation
```
@article{Jiang2024MANTISIM,
title={MANTIS: Interleaved Multi-Image Instruction Tuning},
author={Dongfu Jiang and Xuan He and Huaye Zeng and Cong Wei and Max W.F. Ku and Qian Liu and Wenhu Chen},
journal={Transactions on Machine Learning Research},
year={2024},
volume={2024},
url={https://openreview.net/forum?id=skLtdUVaJa}
}
``` |