File size: 3,840 Bytes
573ac83 f9660a9 106a9b1 cfca544 106a9b1 fe4fa7f 106a9b1 cfca544 106a9b1 cfca544 106a9b1 3b89774 106a9b1 4a06bca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: apache-2.0
---
# Model Card for Model ID
## Welcome
If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !
## 📝Belle-VL
### 背景介绍
**社区目前已经有很多多模态大语言模型相关开源工作,但大多以英文能力为主,比如[LLava](https://github.com/haotian-liu/LLaVA),[CogVLM](https://github.com/THUDM/CogVLM)等,而中文多模态大语言模型比如[VisualGLM-6B](https://github.com/THUDM/VisualGLM-6B)、[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的语言模型基座均较小,实际应用中很难兼顾视觉和语言能力,因此Belle-VL选择基于更强的语言模型基座来扩展模型的视觉能力,为社区提供更加灵活的选择。**
### 模型简介
在模型结构方面,我们主要参考的[Qwen-VL](https://github.com/QwenLM/Qwen-VL)模型,原始Qwen-VL是基于Qwen7B模型训练而来,基座能力相对较弱,因此Belle-VL将语言模型扩展成了[Qwen14B-chat](https://huggingface.co/Qwen/Qwen-14B-Chat),在中文语言能力和视觉能力方面可以兼顾,具备更好的扩展性。
### 训练策略
原始Qwen-vl采用了三阶段的训练方式,包括预训练、多任务训练和指令微调,依赖较大的数据和机器资源。受LLava1.5的启发,多模态指令微调比预训练更加重要,因此我们采用了两阶段的训练方式,如下图所示:
![Traing_stage](./train.png)
### 训练数据
* **预训练数据**:预训练数据主要是基于LLava 的[558k](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)英文指令数据及其对应的中文翻译数据,此外我们还收集了[Flickr30k-CNA](https://zero.so.com/) 以及从[AI Challenger](https://tianchi.aliyun.com/dataset/145781?spm=a2c22.12282016.0.0.5c823721PG2nBW)随机选取的100k数据
* **多模态指令数据**:指令微调阶段,数据主要来自[LLava](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR),[LVIS-INSTRUCT4V](https://github.com/X2FD/LVIS-INSTRUCT4V)等开源项目,我们也对其中部分数据进行了翻译,在此真诚的感谢他们为开源所做出的贡献!
### 模型使用
``` python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_dir = '/path/to_finetuned_model/'
img_path = 'you_image_path'
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained(model_dir, trust_remote_code=True)
question = '详细描述一下这张图'
query = tokenizer.from_list_format([
{'image': img_path}, # Either a local path or an url
{'text': question},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
#or
query = f'<img>{img_path}</img>\n{question}'
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
```
### [MME Benchmark](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)
| Category | Score |
|------------------------|-------|
| **Perception** | **1595.34** |
| Existence | 190 |
| Count | 150 |
| Position | 130 |
| Color | 175 |
| Posters | 166.33|
| Celebrity | 136.76|
| Scene | 156.25|
| Landmark | 174 |
| Artwork | 139.5 |
| OCR | 177.5 |
| **Cognition** | **332.14** |
| CommonsenseReasoning | 127.14|
| Numerical calculation | 47.5 |
| Text translation | 102.5 |
| code_reasoning | 55 |
|