File size: 7,437 Bytes
1aa8b0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110c60a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
pipeline_tag: visual-question-answering

tags:

- indox
- phoenix
- osllm.ai
- language
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.

## Model Details

<!-- Provide a longer summary of what this model is. -->

- **License:** apache-2.0
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct

## Benchmark Results

| MMStar | MMBench | MMVet | MathVista | AI2D | Hallusion | Average |
|--------|---------|-------|-----------|------|-----------|---------|
| 57.6   | 75.0    | 60.3  | 54.8      | 85.7 | 47.8      | 63.5    |

## Reproduction

<!-- This section describes the evaluation protocols and provides the results. -->

To reproduce our results, you should use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and the following settings.

| Parameter         | Value   |
|-------------------|---------|
| do_sample         | True    |
| temperature       | 0.6     |
| top_p             | 0.9     |
| max_new_tokens    | 2048    |

You may change them in [this file](https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/vlm/llama_vision.py), line 80-83, and modify the max_new_tokens throughout the file.

Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048.

After you get the results, you should filter the model output and only **keep the outputs between \<CONCLUSION\> and \</CONCLUSION\>**.

This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes.

By keeping the outputs between \<CONCLUSION\> and \</CONCLUSION\>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.

## How to Get Started with the Model

You can use the inference code for Llama-3.2-11B-Vision-Instruct.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model is trained on the LLaVA-o1-100k dataset (to be released).

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model is finetuned on [llama-recipes](https://github.com/Meta-Llama/llama-recipes) with the following settings.
Using the same setting should accurately reproduce our results.

| Parameter                     | Value                                             |
|-------------------------------|---------------------------------------------------|
| FSDP                          | enabled                                           |
| lr                            | 1e-5                                              |
| num_epochs                    | 3                                                 |
| batch_size_training           | 4                                                 |
| use_fast_kernels              | True                                              |
| run_validation                | False                                             |
| batching_strategy             | padding                                           |
| context_length                | 4096                                              |
| gradient_accumulation_steps   | 1                                                 |
| gradient_clipping             | False                                             |
| gradient_clipping_threshold   | 1.0                                               |
| weight_decay                  | 0.0                                               |
| gamma                         | 0.85                                              |
| seed                          | 42                                                |
| use_fp16                      | False                                             |
| mixed_precision               | True                                              |


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. 
Technically, the model's performance in aspects like instruction following still falls short of leading industry models.




**About [osllm.ai](https://osllm.ai)**:

[osllm.ai](https://osllm.ai) is a community-driven platform that provides access to a wide range of open-source language models.

1. **[IndoxJudge](https://github.com/indoxJudge)**: A free, open-source tool for evaluating large language models (LLMs).  
It provides key metrics to assess performance, reliability, and risks like bias and toxicity, helping ensure model safety.

1. **[inDox](https://github.com/inDox)**: An open-source retrieval augmentation tool for extracting data from various  
document formats (text, PDFs, HTML, Markdown, LaTeX). It handles structured and unstructured data and supports both  
online and offline LLMs.

1. **[IndoxGen](https://github.com/IndoxGen)**: A framework for generating high-fidelity synthetic data using LLMs and  
human feedback, designed for enterprise use with high flexibility and precision.

1. **[Phoenix](https://github.com/Phoenix)**: A multi-platform, open-source chatbot that interacts with documents  
locally, without internet or GPU. It integrates inDox and IndoxJudge to improve accuracy and prevent hallucinations,  
ideal for sensitive fields like healthcare.

1. **[Phoenix_cli](https://github.com/Phoenix_cli)**: A multi-platform command-line tool that runs LLaMA models locally,  
supporting up to eight concurrent tasks through multithreading, eliminating the need for cloud-based services.



**Disclaimers**


[osllm.ai](https://osllm.ai) is not the creator, originator, or owner of any Model featured in the Community Model Program. Each Community Model is created and provided by third parties. osllm.ai does not endorse, support, represent, or guarantee the completeness, truthfulness, accuracy, or reliability of any Community Model. You understand that Community Models can produce content that might be offensive, harmful, inaccurate, or otherwise inappropriate, or deceptive. Each Community Model is the sole responsibility of the person or entity who originated such Model. osllm.ai may not monitor or control the Community Models and cannot, and does not, take responsibility for any such Model. osllm.ai disclaims all warranties or guarantees about the accuracy, reliability, or benefits of the Community Models. osllm.ai further disclaims any warranty that the Community Model will meet your requirements, be secure, uninterrupted, or available at any time or location, or error-free, virus-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to the Community Models, your downloading of any Community Model, or use of any other Community Model provided by or through [osllm.ai](https://osllm.ai).