File size: 11,043 Bytes
7a3362f
 
 
 
 
 
 
 
 
bef178b
7a3362f
bef178b
 
7a3362f
bef178b
 
 
7a3362f
bef178b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a3362f
 
 
 
bef178b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a3362f
 
 
 
bef178b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a3362f
 
 
bef178b
7a3362f
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: image-text-to-text
---


# Model description
We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.

`XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. \
These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below,

* The **pretrained** foundation model, `xgen-mm-phi3-mini-base-r-v1`, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
* The **instruct** fine-tuned model, `xgen-mm-phi3-mini-instruct-r-v1`, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. 
* `xgen-mm-phi3-mini-instruct-r-v1` supports flexible high-resolution image encoding with efficient visual token sampling.  

More technical details will come with a technical report soon.


# Datasets

| Dataset Type| Dataset(s) Used                          |
|--------|------------------------------------------|
| Pretrain | caption data: (datacomp, cc12m, cc3m, SBU, vg) && interleaved data: obelics |
| Instruction Tuning    | LLaVA-Instruct-150K, ShareGPT4V captions, a mixture of academic VQA data including OCR/Document/Chart-focused tasks, publicly available text-only instruction data |

# Results

### Pretrain (base model without instruction tuning)
| Model       | Shot | COCO (val) | NoCaps (val) | TextCaps (val) | OKVQA  (val) | TextVQA (val) | VizWiz (testdev) | VQAv2 (testdev) |
|-------------|------|------------|--------------|----------------|--------------|---------------|------------------|-----------------|
| Flamingo-3B |    4 |       85.0 | -            | -              |         43.3 |          32.7 |               34 |            53.2 |
|             |    8 |       90.6 | -            | -              |         44.6 |          32.4 |             38.4 |            55.4 |
| MM1-3B      |    0 |       73.5 |         55.6 |           63.3 |         26.1 |          29.4 |             15.6 |            46.2 |
|             |    4 |      112.3 |         99.7 |           84.1 |         48.6 |          45.3 |             38.0 |            57.9 |
|             |    8 |      114.6 |        104.7 |           88.8 |         48.4 |          44.6 |             46.4 |            63.6 |
| **xgen-mm-phi3-mini-base-r-v1 (Ours)**|    0 |       **81.7** |         **80.2** |           60.7 |         **26.5** |          **36.0** |             **21.2** |            **48.1** |
|             |    4 |      110.5 |        **101.7** |           **84.6** |         **49.2** |          **46.1** |             **38.4** |            **63.9** |
|             |    8 |      112.1 |        104.4 |           87.7 |         **49.1** |          **46.4** |             44.3 |            **63.8** |

### Instruct (after instruction tuning)
| Model                      | SEED-IMG | MMBench(dev) | MME-total | MME-P    | MME-C   | MMStar   | MMMU (val) | MMVet    | MathVista (mini) | ScienceQA (test) | POPE      | AI2D     |   |
|----------------------------|----------|--------------|-----------|----------|---------|----------|------------|----------|------------------|------------------|----------|----------|---|
| MM1-3B-Chat                | 68.8     | 67.8         | 1761      | **1482**     | 279     | -        | 33.9       | 43.7     | -                | -                | **87.4**            | -        |   |
| openbmb/MiniCPM-V-2        | 67.1     | 69.6         | 1808      | -        | -       | -        | 38.2       | -        | 38.7             | -                | -         | -        |   |
| VILA1.5-3B                 | 67.9     | 63.4         | -         | 1442     | -       | -        | 33.3       | 35.4     | -                | 69.0             | 85.9       | -        |   |
| xtuner/llava-phi-3-mini-hf | 70.0     | 69.2         | 1790      | 1477     | 313     | 43.7     | **41.4**       | -        | -                | 73.7             | 87.3       | 69.3     |   |
| **xgen-mm-phi3-mini-instruct-r-v1 (Ours)** | **72.1**     | **74.1**         | **1827**      | 1467     | **360**     | **44.6**     | 39.8       | **45.1**     | **39.3**             | **74.2**             | 87.2       | **75.8**     |   |


# How to use

> We require the use of the development version (`"4.41.0.dev0"`) of the `transformers` library. To get it, as of 05/07/2024, one can use `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers.`

```python
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor
import json
import PIL
import IPython.display as display
import torch
model = AutoModelForVision2Seq.from_pretrained("./", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("./", trust_remote_code=True, use_fast=True, legacy=False)
image_processor = AutoImageProcessor.from_pretrained("./", trust_remote_code=True)
tokenizer = model.update_special_tokens(tokenizer)

model = model.to('cuda')
tokenizer.padding_side = "left"

def apply_prompt_template(prompt, num_images=1, num_tokens_per_vis = 128, in_context=False, output=None):
    """
    num_tokens_per_vis: model.vlm.num_tokens_per_vis
    """
    placeholder_image_tokens = "<image placeholder>" * (num_tokens_per_vis - 1)
    if in_context:
        formatted_prompt = f"<image>{placeholder_image_tokens}" + f"{prompt}" + f"{output}" + "<|endofchunk|>"
    else:
        formatted_prompt = f"<image>{placeholder_image_tokens}"*num_images + f"{prompt}"
    return formatted_prompt

############ Zero shot inference ##########
with open('./test_samples/zero_shot.json') as f:
    sample = json.load(f)
instruction = sample['instruction']
img = PIL.Image.open(sample['image_path'])
print("==> Instruction: ", instruction)
print("==> Image: ")
display.display(img.resize((int(img.width*0.3), int(img.height*0.3))))
inputs = image_processor([img], return_tensors="pt")
prompt = apply_prompt_template(instruction)
language_inputs = tokenizer([prompt], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    generated_text = model.generate(**inputs, 
                                    pad_token_id=tokenizer.pad_token_id,
                                    do_sample=False, max_new_tokens=256, top_p=None, num_beams=1,
                                    length_penalty=1.0, repetition_penalty=2.0)
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print("==> prediciton: ", prediction)
print("-"*120)
# ==> prediciton:  A man sits on a bench in front of the Red Corner Cafe.

############ Few shots inference ##########
# prepare in-context examples
with open('./test_samples/few_shots.json') as f:
    incontext_data = json.load(f)
print(f'In-context learning with {len(incontext_data)} examples.')
context_images, context_text = [], ""
for example in incontext_data:
    print("-"*40 + f" {example} " + "-"*40)
    img = PIL.Image.open(incontext_data[example]['image_path'])
    instruction = incontext_data[example]['instruction']
    example_text = apply_prompt_template(prompt=instruction, in_context=True, output=incontext_data[example]['output'])
    context_images.append(img)
    context_text += (example_text)
    print("==> Instruction: ", instruction)
    print("==> Image: ")
    display.display(img.resize((int(img.width*0.3), int(img.height*0.3))))
    print("==> Output: ", incontext_data[example]['output'])
# prepare test example
with open('./test_samples/zero_shot.json') as f:
    sample = json.load(f)
instruction = "A short description of this image in one sentence:"
print("-"*40 + " Prediction " + "-"*40)
img = PIL.Image.open(sample['image_path'])
print("==> Instruction: ", instruction)
print("==> Image: ")
display.display(img.resize((int(img.width*0.3), int(img.height*0.3))))
prompt = apply_prompt_template(instruction)
batch_images = context_images + [img]
batch_text = context_text + prompt
# prepare inputs
inputs = image_processor(batch_images, return_tensors="pt")
language_inputs = tokenizer([batch_text], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    generated_text = model.generate(**inputs, 
                                    pad_token_id=tokenizer.pad_token_id,
                                    do_sample=False, max_new_tokens=256, top_p=None, num_beams=1,
                                    length_penalty=1.0)
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print("==> prediciton: ", prediction)
print("-"*120)
```

More comprehensive examples can be found in the [notebook](demo.ipynb).

# Reproducibility: 

Our SFT evaluation is based on the VLMEvalKit, in which we fixed some inconsistencies with the official benchmarks (e.g., LLM judge API). During our development, we noticed that the raw resolution of the input image would noticeably affect the model output in some cases.


# Bias, Risks, Limitations, and Ethical Considerations
The main data sources are from the internet, including webpages, 
image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns.
The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. 
We strongly recommend users assess safety and fairness before applying to downstream applications. 


# License

Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 [LICENSE](LICENSE.txt). Please fill out a form at [here](https://forms.gle/ffPc9oZC2ZGeJ1N68) to consult the commercial use of model weights.

# Code acknowledgement

[LAVIS](https://github.com/salesforce/LAVIS) \
[openflamingo](https://github.com/mlfoundations/open_flamingo) \
[VLMEvalKit](https://github.com/open-compass/VLMEvalKit/tree/main)


# Citation
```
@misc{xgen_mm_phi3_mini,
    title={xgen-mm-phi3-mini-base Model Card},
    url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1},
    author={Salesforce AI Research},
    month={May},
    year={2024}
}
```

# Troubleshoot

1. If you missed any packages, please consider the following

```
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
```