license: apache-2.0
language:
- zh
- en
VisCPM
VisCPM
is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat
model) and text-to-image generation capabilities (VisCPM-Paint
model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM
is trained based on the large language model CPM-Bee with 10B parameters, fusing visual encoder (Q-Former
) and visual decoder (Diffusion-UNet
) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee
, VisCPM
can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
VisCPM-Chat
VisCPM-Chat
supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes Q-Former
as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models through language modeling training objectives. The model training consists of two stages: pretraining and instruction fine-tuning.
Pretrain:
VisCPM-Chat
was pretrained using approximately 100 million high-quality English multimodal data pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, and others. In this stage, the language model parameters remain fixed, and only the parameters of theQ-Former
are updated to enable efficient alignment of large-scale visual-language representations.Instruction fine-tuning: We utilized the LLaVA-150K dataset, which consists of English multimodal instruction-following dataset. We mixed this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this phase, we updated all model parameters to improve the utilization efficiency of the instruction fine-tuning data. Interestingly, we observed that even when using only English instruction data for fine-tuning, the model can comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction fine-tuning phase, we can align the model's response language with the user's question language.
We evaluated the model on the LLaVA English test set and the translated Chinese test set. The evaluation benchmark examined the model's performance in open-domain conversations, image detail descriptions, and complex reasoning tasks, using GPT-4 for scoring. It is evident that VisCPM-Chat
achieved the best average performance in Chinese multimodal capabilities, excelling in general-domain conversations and complex reasoning. It also demonstrated commendable English multimodal abilities.
VisCPM-Paint
VisCPM-Paint
supports bilingual text-to-image generation. The model uses CPM-Bee (10B) as the text encoder, UNet
as the image decoder, and trains the fusion of language and visual models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters: first training a linear layer to map text representations to the visual model, and then further unfreezing the cross-attention layers of UNet
. The model was trained on the LAION 2B English text-image pair dataset.
Similar to VisCPM-Chat
, we found that thanks to the bilingual ability of CPM-Bee, VisCPM-Paint
can be trained only by English image-text pairs, and generalized to achieve good Chinese text-to-image generation capabilities, reaching the Chinese open source model best effect. By further adding 20M cleaned original Chinese image-text data and 120M image-text data translated into Chinese, the Chinese text-to-image generation capability of the model can be further improved.