File size: 8,840 Bytes
da95d72 c38a7fe 360c87d da95d72 c38a7fe c57febc ccc339e c38a7fe 0fdcbd7 c38a7fe c57febc c38a7fe 0f3cf67 c38a7fe c1d4ea1 f564018 c38a7fe 0f3cf67 c38a7fe ccc339e c57febc ccc339e c57febc ccc339e c38a7fe c57febc 869e5e3 d1a7f0a 5918d23 c38a7fe d1a7f0a 1bbed45 8d39eda d1a7f0a 8d39eda d1a7f0a fed2e7c d1a7f0a fed2e7c 0348b61 fed2e7c 0348b61 c38a7fe da92483 b46dbcf da92483 0348b61 9d90772 e2a1dc7 0f3cf67 e2a1dc7 c38a7fe 06d367c c38a7fe 614f6b8 c38a7fe ef3aa9f c38a7fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
license: mit
datasets:
- laion/laion2B-en
- laion/laion-coco
- laion/laion2B-multi
- kakaobrain/coyo-700m
- conceptual_captions
- wanng/wukong100m
pipeline_tag: visual-question-answering
---
# Model Card for InternVL-Chat-V1-1
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4IG0h_KJ2cvpp9Kdm0Jf7.webp" alt="Image Description" width="300" height="300">
</p>
[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
[\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
We released InternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In this version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations.
## Model Details
- **Model Type:** multimodal large language model (MLLM)
- **Model Stats:**
- Architecture: [InternViT-6B-448px](https://huggingface.co/OpenGVLab/InternViT-6B-448px) + MLP + LLaMA2-13B (Our internal SFT versions)
- Image size: 448 x 448 (256 tokens)
- Params: 19B
- **Training Strategy:**
- Pretraining Stage
- Learnable Component: InternViT-6B + LLaMA2-13B
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
- Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle operation to reduce 1024 tokens to 256 tokens.
- Supervised Finetuning Stage
- Learnable Component: MLP + LLaMA2-13B
- Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
## Released Models
### Vision Foundation model
| Model | Date | Download | Note |
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
| InternViT-6B-448px-V1-5 | 2024.04.20 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (🔥new) |
| InternViT-6B-448px-V1-2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution |
| InternViT-6B-448px-V1-0 | 2024.01.30 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution |
| InternViT-6B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
| InternVL-14B-224px | 2023.12.22 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
### Multimodal Large Language Model (MLLM)
| Model | Date | Download | Note |
| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
| InternVL-Chat-V1-5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
| InternVL-Chat-V1-2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
| InternVL-Chat-V1-2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
| InternVL-Chat-V1-1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
## Model Usage
We provide an example code to run InternVL-Chat-V1-1 using `transformers`.
You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
```python
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer
path = "OpenGVLab/InternVL-Chat-V1-1"
# If your GPU has more than 40G memory, you can put the entire model on a single GPU.
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True).eval().cuda()
# Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
# model = AutoModel.from_pretrained(
# path,
# torch_dtype=torch.bfloat16,
# low_cpu_mem_usage=True,
# trust_remote_code=True,
# device_map='auto').eval()
tokenizer = AutoTokenizer.from_pretrained(path)
image = Image.open('./examples/image2.jpg').convert('RGB')
image = image.resize((448, 448))
image_processor = CLIPImageProcessor.from_pretrained(path)
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)
# single-round conversation
question = "请详细描述图片"
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)
# multi-round conversation
question = "请详细描述图片"
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(question, response)
question = "请根据图片写一首诗"
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(question, response)
```
## Examples
In this update, InternVL-Chat has **improved support for Chinese and OCR**.
As you can see, although the Lynyrd Skynyrd in the image has some letters that are out of the camera's lens, and TOUR's T is blocked, the model is still able to recognize it correctly.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/-jQ8jCctx1VjkzVxzChQa.png)
This model can also conduct an in-depth analysis of AAAI's official website and identify important information on the web page.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/08W04RdT3PmzJuGwFU3--.png)
## Evaluation
See [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation) for detailed evaluation results.
## Citation
If you find this project useful in your research, please consider citing:
```BibTeX
@article{chen2023internvl,
title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
journal={arXiv preprint arXiv:2312.14238},
year={2023}
}
```
## License
This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
## Acknowledgement
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|