VisCPM
Chinese-English bilingual multi-modal large model series based on CPM (Chinese Pretrained Models) basic model
Github β’ VisCPM-Chat
VisCPM
is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat
model) and text-to-image generation capabilities (VisCPM-Paint
model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model CPM-Bee with 10B parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, VisCPM
can be pre-trained with English multimodal data only and well generalize to achieve promising Chinese multimodal capabilities.
- π Open-source Usage: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.
- π Image and text generation coverage: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.
- π« Excellent bilingual performance: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.
VisCPM-Paint
VisCPM-Paint
supports bilingual text-to-image generation. The model uses CPM-Bee
as the text encoder, UNet
as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the LAION 2B English text-image pair dataset.
Similar to VisCPM-Chat
, we found that due to the bilingual capability of CPM-Bee
, VisCPM-Paint
can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (FrΓ©chet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely VisCPM-Paint-balance
and VisCPM-Paint-zhplus
. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. VisCPM-Paint-balance
is trained only using English text-image pairs, while VisCPM-Paint-zhplus
incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on VisCPM-Paint-balance
.
How to Use
#!/usr/bin/env python
# encoding: utf-8
from diffusers import DiffusionPipeline
from transformers import AutoModel
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
text_encoder = AutoModel.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
print('load pipeline')
pipeline = DiffusionPipeline.from_pretrained('openbmb/VisCPM-Paint', custom_pipeline="openbmb/VisCPM-Paint", text_encoder=text_encoder, tokenizer=tokenizer)
pipeline = pipeline.to('cuda')
prompt = "a photo of an astronaut riding a horse on mars"
image = pipeline(prompt).images[0]
image.save("astronaut_rides_horse.png")
π License
VisCPM is governed by the GML License, and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to negotiate commercial licensing.
The CPM-Bee base, governed by the General Model License (GML), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to obtain the certificate of authorization.
- Downloads last month
- 69