File size: 7,328 Bytes
80e67c4
 
36aea8a
 
 
 
 
80e67c4
 
 
 
b2a917f
80e67c4
 
36aea8a
 
80e67c4
cdf3527
80e67c4
cdf3527
80e67c4
36aea8a
80e67c4
 
b2a917f
36aea8a
80e67c4
cdf3527
80e67c4
cdf3527
80e67c4
cdf3527
80e67c4
 
 
36aea8a
80e67c4
cdf3527
80e67c4
36aea8a
cdf3527
36aea8a
80e67c4
36aea8a
80e67c4
36aea8a
cdf3527
36aea8a
cdf3527
36aea8a
cdf3527
36aea8a
80e67c4
cdf3527
 
 
 
 
 
b2a917f
36aea8a
 
80e67c4
36aea8a
80e67c4
36aea8a
 
 
cdf3527
36aea8a
cdf3527
1e70f2b
80e67c4
cdf3527
 
 
 
36aea8a
80e67c4
 
 
b2a917f
 
36aea8a
80e67c4
36aea8a
80e67c4
36aea8a
 
 
 
 
80e67c4
36aea8a
 
 
 
 
 
 
 
 
 
 
80e67c4
b2a917f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80e67c4
b2a917f
36aea8a
80e67c4
36aea8a
80e67c4
36aea8a
 
b2a917f
80e67c4
b2a917f
80e67c4
b2a917f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
library_name: transformers
datasets:
- benchang1110/TaiVision-pretrain-1M
language:
- zh
pipeline_tag: image-text-to-text
---

# Model Card for Model ID

![TaivisionLM]()
## Model Details

## English
# TaiVisionLM: The First of Its Kind! 🚀

🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️

Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖


## 繁體中文
# 臺視: 首創獨一無二的視覺語言模型!! 🚀

🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟

✨ TaiVisionLM 可以用 transformers 載入、微調和使用!⚡️

準備好體驗"臺視"了嗎?讓我們開始吧!🖼️🤖



---

### Model Description

## English
This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/google/siglip-base-patch16-224) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together. 
Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).

Here's the summary of the development process:

1) **Unimodal pretraining**
    - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
2) **Feature Alignment**
    - I train the vision projector and language model using LoRA using 1B image-text pairs to align visual and textual features.
3) **Task Specific Training**
    - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering.
    We will undergo this stage after the dataset is ready!



- **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
- **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- **Language(s) (NLP):** *Traditional Chinese*

## 繁體中文
這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。  
其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。

以下是開發過程的摘要:

1) **單模態預訓練**
   - 在這個階段,我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器,以及我們自己訓練的語言模型([Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat))。
2) **特徵對齊**
   - 我使用了10億個圖片和文本的配對來訓練圖像投影器 (visual projector),並使用 LoRA 來微調語言模型的權重。
3) **任務特定訓練**
   - 對齊後的模型將進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練!


- **創作者:** [benchang1110](https://huggingface.co/benchang1110)
- **模型類型:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
- **語言:** *Traditional Chinese*
  
---

## How to Get Started with the Model

## English

In Transformers, you can load the model and do inference as follows:

**IMPORTANT NOTE:** TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_taivisionlm.py```, ```modeling_taivisionlm.py``` and ```processing_taivisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code.

```python
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch

config = AutoConfig.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)
```

## 中文
利用 transformers,可以用下面程式碼進行推論:

**重要通知:** 台視 (TaiVisionLM)還沒被整合進transformers,因此在下載模型時要使用 ```trust_remote_code=True```,下載模型將會使用``configuration_taivisionlm.py```、 ```modeling_taivisionlm.py``` 和 ```processing_taivisionlm.py``` 這三個檔案,若擔心有惡意程式碼,請先點選右方 *Files and Versions* 來查看程式碼內容。

```python
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch

config = AutoConfig.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVision-base",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)
```

### Training Procedure

The following training hyperparameters are used in feature alignment and task specific training stages respectively:

- **Feature Alignment**

| Data size    | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
|--------------|-------------------|---------------|--------|------------|--------------|
| 1B        | 16               | 5e-5          | 1      | 2048       | 1e-5            |

We use full-parameter finetuning for the projector and apply LoRA to the language model.

### Compute Infrastructure
- **Feature Alignment**
- 1xV100(32GB), took approximately 16 GPU hours.