File size: 4,340 Bytes

be5e6b2
 
ca83f46
 
 
be5e6b2
247801a
 
 
ca83f46

---
license: apache-2.0
language:
- zh
- en
---

# VisCPM

[GITHUB](https://github.com/OpenBMB/VisCPM)

`VisCPM` is a family of open-source large multimodal models, which support multimodal conversational capabilities (`VisCPM-Chat` model) and text-to-image generation capabilities (`VisCPM-Paint` model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. `VisCPM` is trained based on the large language model [CPM-Bee](https://huggingface.co/openbmb/cpm-bee-10b) with 10B parameters, fusing visual encoder (`Q-Former`) and visual decoder (`Diffusion-UNet`) to support visual inputs and outputs. Thanks to the good bilingual capability of `CPM-Bee`, `VisCPM` can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.

## VisCPM-Chat
`VisCPM-Chat` supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes `Q-Former` as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models through language modeling training objectives. The model training consists of two stages: pretraining and instruction fine-tuning.

* Pretrain: `VisCPM-Chat` was pretrained using approximately 100 million high-quality English multimodal data pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, and others. In this stage, the language model parameters remain fixed, and only the parameters of the `Q-Former` are updated to enable efficient alignment of large-scale visual-language representations.

* Instruction fine-tuning: We utilized the [LLaVA-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) dataset, which consists of English multimodal instruction-following dataset. We mixed this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this phase, we updated all model parameters to improve the utilization efficiency of the instruction fine-tuning data. Interestingly, we observed that even when using only English instruction data for fine-tuning, the model can comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction fine-tuning phase, we can align the model's response language with the user's question language.

We evaluated the model on the LLaVA English test set and the translated Chinese test set. The evaluation benchmark examined the model's performance in open-domain conversations, image detail descriptions, and complex reasoning tasks, using GPT-4 for scoring. It is evident that `VisCPM-Chat` achieved the best average performance in Chinese multimodal capabilities, excelling in general-domain conversations and complex reasoning. It also demonstrated commendable English multimodal abilities.

## VisCPM-Paint
`VisCPM-Paint` supports bilingual text-to-image generation. The model uses CPM-Bee (10B) as the text encoder, `UNet` as the image decoder, and trains the fusion of language and visual models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1), and it is fused with the language model by gradually unfreezing key bridging parameters: first training a linear layer to map text representations to the visual model, and then further unfreezing the cross-attention layers of `UNet`. The model was trained on the [LAION 2B](https://huggingface.co/datasets/laion/laion2B-en) English text-image pair dataset.

Similar to `VisCPM-Chat`, we found that thanks to the bilingual ability of CPM-Bee, `VisCPM-Paint` can be trained only by English image-text pairs, and generalized to achieve good Chinese text-to-image generation capabilities, reaching the Chinese open source model best effect. By further adding 20M cleaned original Chinese image-text data and 120M image-text data translated into Chinese, the Chinese text-to-image generation capability of the model can be further improved.