benchang1110 commited on
Commit
cdf3527
1 Parent(s): 1e70f2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -22
README.md CHANGED
@@ -9,18 +9,15 @@ pipeline_tag: image-text-to-text
9
 
10
  # Model Card for Model ID
11
 
12
- <!-- Provide a quick summary of what the model is/does. -->
13
-
14
-
15
 
16
  ## Model Details
17
 
18
  ## English
19
  # TaiVisionLM: The First of Its Kind! 🚀
20
 
21
- 🌟 This is a very fast and small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟
22
 
23
- ✨ Developed compatible with the Transformers library, TaiVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️
24
 
25
  Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
26
 
@@ -28,32 +25,38 @@ Ready to experience the Traditional Chinese visual language model? Let's go!
28
  ## Traditional Chinese
29
  # 臺視: 首創獨一無二的視覺語言模型!! 🚀
30
 
31
- 🌟 TaiVisionLM 是一個非常快速且小巧的視覺語言模型(僅有 12 億參數),在 Hugging Face 上可以根據圖像輸入來回應繁體中文指令!🌟
32
 
33
- ✨ TaiVisionLM Transformers 完全相容,易於載入、微調和使用,用於快速推理——不需要任何外部庫!⚡️
34
 
35
- 準備好體驗這個繁體中文視覺語言模型了嗎?讓我們開始吧!🖼️🤖
36
 
37
 
38
 
39
  ---
40
 
41
- # Model Details
42
 
43
  ## English
44
- This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together.
45
  Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).
46
 
47
  Here's the summary of the development process:
48
 
49
  1) **Unimodal pretraining**
50
- - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
51
  2) **Feature Alignment**
52
- - Following the [LLaVA training recipe](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train), I train the vision projector using 1B image-text pairs to align visual and textual features.
53
  3) **Task Specific Training**
54
- - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
55
  We will undergo this stage after the dataset is ready!
56
 
 
 
 
 
 
 
57
  ## 中文
58
  這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
59
  其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
@@ -63,18 +66,15 @@ Here's the summary of the development process:
63
  1) **單模態預訓練**
64
  - 在這個階段,我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器,以及我們自己訓練的語言模型([Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat))。
65
  2) **特徵對齊**
66
- - 根據 [LLaVA](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#train),我使用 10 億個圖文配對數據來訓練視覺投影器,以對齊視覺和文本特徵。
67
  3) **任務特定訓練**
68
- - 對齊後的模型進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務,使用超過 100 萬組圖像-提示-完成三元組數據進行訓練。我們將在數據集準備好後進行這一階段的訓練!
69
-
70
- ### Model Description
71
- <!-- Provide a longer summary of what this model is. -->
72
-
73
- - **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
74
- - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
75
- - **Language(s) (NLP):** *Traditional Chinese*
76
 
77
 
 
 
 
 
78
  ---
79
 
80
  ## How to Get Started with the Model
 
9
 
10
  # Model Card for Model ID
11
 
 
 
 
12
 
13
  ## Model Details
14
 
15
  ## English
16
  # TaiVisionLM: The First of Its Kind! 🚀
17
 
18
+ 🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟
19
 
20
+ ✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️
21
 
22
  Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖
23
 
 
25
  ## Traditional Chinese
26
  # 臺視: 首創獨一無二的視覺語言模型!! 🚀
27
 
28
+ 🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟
29
 
30
+ ✨ TaiVisionLM 可以用 transformers 載入、微調和使用!⚡️
31
 
32
+ 準備好體驗"臺視"了嗎?讓我們開始吧!🖼️🤖
33
 
34
 
35
 
36
  ---
37
 
38
+ ### Model Description
39
 
40
  ## English
41
+ This model is a multimodal large language model that combines [SigLIP](https://huggingface.co/google/siglip-base-patch16-224) as its vision encoder with [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) as its language model. The vision projector connects the two modalities together.
42
  Its architecture closely resembles [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma).
43
 
44
  Here's the summary of the development process:
45
 
46
  1) **Unimodal pretraining**
47
+ - In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224) and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
48
  2) **Feature Alignment**
49
+ - I train the vision projector and language model using LoRA using 1B image-text pairs to align visual and textual features.
50
  3) **Task Specific Training**
51
+ - The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering.
52
  We will undergo this stage after the dataset is ready!
53
 
54
+
55
+
56
+ - **Developed by:** [benchang1110](https://huggingface.co/benchang1110)
57
+ - **Model type:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
58
+ - **Language(s) (NLP):** *Traditional Chinese*
59
+
60
  ## 中文
61
  這個模型是一個多模態的語言模型,結合了 [SigLIP](https://huggingface.co/docs/transformers/en/model_doc/siglip) 作為其視覺編碼器,並使用 [Tinyllama](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat) 作為語言模型。視覺投影器將這兩種模態結合在一起。
62
  其架構與 [PaliGemma](https://huggingface.co/docs/transformers/v4.44.0/model_doc/paligemma) 非常相似。
 
66
  1) **單模態預訓練**
67
  - 在這個階段,我利用了 [google/siglip-base-patch16-224-multilingual](https://huggingface.co/google/siglip-base-patch16-224-multilingual) 的圖像編碼器,以及我們自己訓練的語言模型([Taiwan-tinyllama-v1.0-chat](https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat))。
68
  2) **特徵對齊**
69
+ - 我使用了10億個圖片和文本的配對來訓練圖像投影器 (visual projector),並使用 LoRA 來微調語言模型的權重。
70
  3) **任務特定訓練**
71
+ - 對齊後的模型將進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練!
 
 
 
 
 
 
 
72
 
73
 
74
+ - **創作者:** [benchang1110](https://huggingface.co/benchang1110)
75
+ - **模型類型:** [Image-Text-to-Text](https://huggingface.co/tasks/image-text-to-text)
76
+ - **語言:** *Traditional Chinese*
77
+
78
  ---
79
 
80
  ## How to Get Started with the Model