bconsolvo commited on
Commit
295ac35
1 Parent(s): 8ec8773

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -39
README.md CHANGED
@@ -1,34 +1,79 @@
1
  ---
2
  language:
3
- - en
4
- license: other
5
  license_name: gemma-terms
6
  license_link: https://ai.google.dev/gemma/terms
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
 
9
- # LLaVA-Gemma Model Card
10
 
11
- _This model card corresponds to the 2B version of the model with the CLIP-based vision encoder._
12
 
13
- Preprint: [arxiv.org/abs/2404.01331](https://arxiv.org/abs/2404.01331)
 
 
 
 
 
 
 
 
14
 
15
- ## Overview
16
 
17
- `llava-gemma-2b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 2-billion parameter `google/gemma-2b-it` model as language backbone.
18
 
19
- ## Uses
 
 
 
 
20
 
21
- The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
22
 
23
- ## Bias, Risks, and Limitations
24
-
25
- This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
26
-
27
- ## How to Get Started with the Model
28
-
29
- Currently using `llava-gemma` requires a [modified preprocessor](./processing_llavagemma.py).
30
-
31
- _We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)), expect updates soon._
32
 
33
  For current usage, see [`usage.py`](./usage.py) or the following code block:
34
 
@@ -66,37 +111,98 @@ inputs = processor(text=prompt, images=image, return_tensors="pt")
66
  generate_ids = model.generate(**inputs, max_length=30)
67
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
68
  print(output)
 
 
 
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
72
- ## Training Details
 
 
 
 
 
 
 
73
 
74
- The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
75
 
76
- ### Training Data
 
 
 
 
77
 
78
- The model was trained using the LLaVA-v1.5 data mixture.
79
 
80
- This is listed as follows:
81
 
82
  - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
83
  - 158K GPT-generated multimodal instruction-following data.
84
  - 450K academic-task-oriented VQA data mixture.
85
  - 40K ShareGPT data.
86
 
87
- ## Evaluation
88
-
89
- | LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP |
90
- | ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | ------- | --------------- | ----- |
91
- | gemma-2b-it | CLIP | Yes | 0.531 | 236.071 | 1130.492 | 17.706 | 0.850 | 0.839 | 70.65 | 28.06 | 0.564 | 0.287 |
92
- | gemma-2b-it | CLIP | No | 0.481 | 247.857 | 934.611 | 13.119 | 0.784 | 0.762 | 61.74 | | 0.549 | 0.180 |
93
- | gemma-7b-it | CLIP | Yes | 0.472 | 253.571 | 894.910 | 18.165 | 0.848 | 0.829 | 68.7 | | 0.625 | 0.327 |
94
- | gemma-7b-it | CLIP | No | 0.472 | 278.214 | 857.274 | 19.083 | 0.782 | 0.734 | 65.09 | | 0.636 | 0.240 |
95
- | gemma-2b-it | DinoV2 | Yes | 0.587 | 307.143 | 1132.970 | 19.128 | 0.853 | 0.838 | 71.37 | 12.53 | 0.555 | 0.227 |
96
- | gemma-2b-it | DinoV2 | No | 0.501 | 308.929 | 959.351 | 14.541 | 0.793 | 0.772 | 61.65 | 11.1 | 0.568 | 0.180 |
97
-
98
- ## Responsible Use
99
-
100
- Intel is committed to respecting human rights and avoiding causing or directly contributing to adverse impacts on human rights.
101
- See [Intel’s Global Human Rights Policy](https://www.intel.com/content/www/us/en/policy/policy-human-rights.html).
102
- The software and the fine-tuned model licensed from Intel is intended for socially responsible applications and should not be used to cause or contribute to a violation of internationally recognized human rights.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
+ - en
4
+ license: gemma
5
  license_name: gemma-terms
6
  license_link: https://ai.google.dev/gemma/terms
7
+ base_model: google/gemma-2b-it
8
+ tags:
9
+ - LLM
10
+ - MMFM
11
+ - Intel
12
+ model-index:
13
+ - name: llava-gemma-2b
14
+ results:
15
+ - task:
16
+ type: Large Language Model
17
+ name: Large Language Model
18
+ metrics:
19
+ - type: GQA
20
+ name: GQA
21
+ value: 0.587
22
+ - type: MME Cog.
23
+ name: MME Cog.
24
+ value: 309
25
+ - type: MME Per.
26
+ name: MME Per.
27
+ value: 1133
28
+ - type: MM-Vet
29
+ name: MM-Vet
30
+ value: 19.1
31
+ - type: POPE Acc.
32
+ name: POPE Acc.
33
+ value: 0.853
34
+ - type: POPE F1
35
+ name: POPE F1
36
+ value: 0.839
37
+ - type: VQAv2
38
+ name: VQAv2
39
+ value: 71.4
40
+ - type: MMVP
41
+ name: MMVP
42
+ value: 0.327
43
+ - type: ScienceQA Image
44
+ name: ScienceQA Image
45
+ value: 0.636
46
+ library_name: transformers
47
+ pipeline_tag: image-text-to-text
48
  ---
49
 
50
+ ## Model Details: LLaVA-Gemma-2b
51
 
52
+ `llava-gemma-2b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 2-billion parameter [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) model as language backbone and the CLIP-based vision encoder.
53
 
54
+ | Model Details | Description |
55
+ | ----------- | ----------- |
56
+ | Authors | Intel: [Musashi Hinck](https://huggingface.co/musashihinck), [Matthew Olson](https://huggingface.co/matthewlyleolson), [David Cobbley](https://huggingface.co/djcobble), [Shao-Yen Tseng](https://huggingface.co/shaoyent), [Vasudev Lal](https://huggingface.co/vasudevlal) |
57
+ | Date | March 2024 |
58
+ | Version | 1 |
59
+ | Type | Large multimodal model (LMM) |
60
+ | Paper or Other Resources | [LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model](https://arxiv.org/abs/2404.01331) |
61
+ | License | [Gemma](https://ai.google.dev/gemma/terms) |
62
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/llava-gemma-2b/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|
63
 
64
+ This model card was created by [Benjamin Consolvo](https://huggingface.co/bconsolvo) and the authors listed above.
65
 
66
+ ## Intended Use
67
 
68
+ | Intended Use | Description |
69
+ | ----------- | ----------- |
70
+ | Primary intended uses | The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot. |
71
+ | Primary intended users | Anyone using or evaluating multimodal models. |
72
+ | Out-of-scope uses | This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights. |
73
 
74
+ ### How to use
75
 
76
+ Currently, using `llava-gemma` requires a [modified preprocessor](./processing_llavagemma.py). _We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)). Expect updates soon._
 
 
 
 
 
 
 
 
77
 
78
  For current usage, see [`usage.py`](./usage.py) or the following code block:
79
 
 
111
  generate_ids = model.generate(**inputs, max_length=30)
112
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
113
  print(output)
114
+ ```
115
+
116
+ For straightforward use as a chatbot (without images), you can modify the last portion of code to the following:
117
 
118
+ ```python
119
+ # Prepare inputs
120
+ # Use gemma chat template
121
+ prompt = processor.tokenizer.apply_chat_template(
122
+ [{'role': 'user', 'content': "Summarize the following paragraph? In this paper, we introduced LLaVA-Gemma, a compact vision-language model leveraging the Gemma Large Language Model in two variants, Gemma-2B and Gemma-7B. Our work provides a unique opportunity for researchers to explore the trade-offs between computational efficiency and multimodal understanding in small-scale models. The availability of both variants allows for a comparative analysis that sheds light on how model size impacts performance in various tasks. Our evaluations demonstrate the versatility and effectiveness of LLaVA-Gemma across a range of datasets, highlighting its potential as a benchmark for future research in small-scale vision-language models. With these models, future practitioners can optimize the performance of small-scale multimodal models more directly."}],
123
+ tokenize=False,
124
+ add_generation_prompt=True
125
+ )
126
+ # url = "https://www.ilankelman.org/stopsigns/australia.jpg"
127
+ # image = Image.open(requests.get(url, stream=True).raw)
128
+ inputs = processor(text=prompt, images=None, return_tensors="pt")
129
+
130
+ # Generate
131
+ generate_ids = model.generate(**inputs, max_length=300)
132
+ output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
133
+ print(output)
134
  ```
135
 
136
+ ## Factors
137
+
138
+ | Factors | Description |
139
+ | ----------- | ----------- |
140
+ | Groups | - |
141
+ | Instrumentation | - |
142
+ | Environment | Trained for 4 hours on 8 Intel Gaudi 2 AI accelerators. |
143
+ | Card Prompts | Model training and deployment on alternate hardware and software will change model performance |
144
 
145
+ ## Metrics
146
 
147
+ | Metrics | Description |
148
+ | ----------- | ----------- |
149
+ | Model performance measures | We evaluate the LlaVA-Gemma models on a similar collection of benchmarks to other LMM works: GQA; MME; MM-Vet; POPE (accuracy and F1); VQAv2; MMVP; the image subset of ScienceQA. Our experiments provide insights into the efficacy of various design choices within the LLaVA framework. |
150
+ | Decision thresholds | - |
151
+ | Approaches to uncertainty and variability | - |
152
 
153
+ ## Training Data
154
 
155
+ The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:
156
 
157
  - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
158
  - 158K GPT-generated multimodal instruction-following data.
159
  - 450K academic-task-oriented VQA data mixture.
160
  - 40K ShareGPT data.
161
 
162
+ ## Quantitative Analyses
163
+
164
+ Performance of LLaVA-Gemma models across seven benchmarks. Highlighted box indicates strongest performance amongst LLaVA-Gemma models. Bottom two rows show self-reported performance of Llava Phi-2 and LLaVA-v1.5 respectively. The bolded **gemma-2b-it** is the current model used here in this model card.
165
+
166
+ | LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | ScienceQA Image | MMVP |
167
+ | ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | --------------- | ----- |
168
+ | gemma-2b-it | CLIP | Yes | 0.531 | 236 | 1130 | 17.7 | 0.850 |<mark>0.839</mark>| 70.65 | 0.564 | 0.287 |
169
+ | **gemma-2b-it** | CLIP | No | 0.481 | 248 | 935 | 13.1 | 0.784 | 0.762 | 61.74 | 0.549 | 0.180 |
170
+ | gemma-2b-it | DinoV2 | Yes |<mark>0.587</mark>| 307| <mark>1133</mark> |<mark>19.1</mark>| <mark>0.853</mark> | 0.838 |<mark>71.37</mark>| 0.555 | 0.227 |
171
+ | gemma-2b-it | DinoV2 | No | 0.501 | <mark>309</mark>| 959 | 14.5 | 0.793 | 0.772 | 61.65 | 0.568 | 0.180 |
172
+ | | | | | | | | | | | | |
173
+ | gemma-7b-it | CLIP | Yes | 0.472 | 253 | 895 | 18.2 | 0.848 | 0.829 | 68.7 | 0.625 | <mark>0.327</mark> |
174
+ | gemma-7b-it | CLIP | No | 0.472 | 278 | 857 | 19.1 | 0.782 | 0.734 | 65.1 | <mark>0.636</mark> | 0.240 |
175
+ | gemma-7b-it | DinoV2 | Yes | 0.519 | 257 | 1021 | 14.3 | 0.794 | 0.762 | 65.2 | 0.628 | <mark>0.327</mark> |
176
+ | gemma-7b-it | DinoV2 | No | 0.459 | 226 | 771 | 12.2 | 0.693 | 0.567 | 57.4 | 0.598 | 0.267 |
177
+ | | | | | | | | | | | | |
178
+ | Phi-2b | CLIP | Yes | - | - | 1335 | 28.9 | - | 0.850 | 71.4 | 0.684 | - |
179
+ | Llama-2-7b | CLIP | Yes | 0.620 | 348 | 1511 | 30.6 | 0.850 | 0.859 | 78.5 | 0.704 | 46.1 |
180
+
181
+ ## Ethical Considerations
182
+
183
+ Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
184
+
185
+ | Ethical Considerations | Description |
186
+ | ----------- | ----------- |
187
+ | Data | The model was trained using the LLaVA-v1.5 data mixture as described above. |
188
+ | Human life | The model is not intended to inform decisions central to human life or flourishing. |
189
+ | Mitigations | No additional risk mitigation strategies were considered during model development. |
190
+ | Risks and harms | This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. |
191
+ | Use cases | - |
192
+
193
+ ## Caveats and Recommendations
194
+
195
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
196
+
197
+ ## Citation details
198
+ ```bibtex
199
+ @misc{hinck2024llavagemma,
200
+ title={LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model},
201
+ author={Musashi Hinck and Matthew L. Olson and David Cobbley and Shao-Yen Tseng and Vasudev Lal},
202
+ year={2024},
203
+ eprint={2404.01331},
204
+ url={https://arxiv.org/abs/2404.01331}
205
+ archivePrefix={arXiv},
206
+ primaryClass={cs.CL}
207
+ }
208
+ ```