TaiVisionLM-base-v2 / README.md
benchang1110's picture
Update README.md
b548db9 verified
|
raw
history blame
10.7 kB
metadata
library_name: transformers
datasets:
  - benchang1110/TaiVision-pretrain-1M-v2.0
language:
  - zh
pipeline_tag: image-text-to-text

Model Card for Model ID

TaivisionLM

Model Details

English

TaiVisionLM: The First of Its Kind! 🚀

🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️

Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖

繁體中文

台視: 台灣視覺語言模型!! 🚀

🌟 TaiVisionLM 是一個小型的視覺語言模型(僅有 12 億參數),可以根據圖像輸入來回覆繁體中文指令!🌟

✨ TaiVisionLM 可以用 transformers 載入、微調和使用!⚡️

準備好體驗"臺視"了嗎?讓我們開始吧!🖼️🤖


Model Description

English

This model is a multimodal large language model that combines SigLIP as its vision encoder with Tinyllama as its language model. The vision projector connects the two modalities together. Its architecture closely resembles PaliGemma.

Here's the summary of the development process:

  1. Unimodal pretraining

  2. Feature Alignment

    • We trained the vision projector and language model using LoRA using 1M image-text pairs to align visual and textual features.
      This model is the finetuned version of benchang1110/TaiVisionLM-base-v1. We fintuned the model using 1M image-text pairs. The finetuned model will generate a longer and more detailed description of the image.
  3. Task Specific Training

    • The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering. We will undergo this stage after the dataset is ready!

繁體中文

這個模型是一個多模態的語言模型,結合了 SigLIP 作為其視覺編碼器,並使用 Tinyllama 作為語言模型。視覺投影器將這兩種模態結合在一起。
其架構與 PaliGemma 非常相似。

以下是開發過程的摘要:

  1. 單模態預訓練
  2. 特徵對齊
    • 我們使用了100萬個圖片和文本的配對來訓練圖像投影器 (visual projector),並使用 LoRA 來微調語言模型的權重。 這個模型是 benchang1110/TaiVisionLM-base-v1 的微調版本。我們使用了100萬個圖片和文本的配對來微調模型。微調後的模型將生成更長、更詳細的圖片描述。
  3. 任務特定訓練
    • 對齊後的模型將進行進一步的訓練,針對短描述、詳細描述和簡單視覺問答等任務。我們將在數據集準備好後進行這一階段的訓練!

How to Get Started with the Model

English

In Transformers, you can load the model and do inference as follows:

IMPORTANT NOTE: TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set trust_remote_code=True when loading the model. It will download the configuration_taivisionlm.py, modeling_taivisionlm.py and processing_taivisionlm.py files from the repo. You can check out the content of these files under the Files and Versions tab and pin the specific versions if you have any concerns regarding malicious code.

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)

中文

利用 transformers,可以用下面程式碼進行推論:

重要通知: 台視 (TaiVisionLM) 還沒被整合進transformers,因此在下載模型時要使用 trust_remote_code=True,下載模型將會使用configuration_taivisionlm.pymodeling_taivisionlm.pyprocessing_taivisionlm.py 這三個檔案,若擔心有惡意程式碼,請先點選右方 Files and Versions 來查看程式碼內容。

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)

Comparision with prior model (benchang1110/TaiVisionLM-base-v1)

  • Example 1 smile
  • TaiVisionLM-base-v1:
    卡通插圖描繪掛在家門口的標誌,上下方以卡通插圖的方式呈現。
  • TaiVisionLM-base-v2:
    這張圖描繪了一個單詞「SMILE」經典的卡通字體。該字表面是黑白的主要色彩調色板。詞以貫穿其身體的光滑線條字體書寫。該字具有模糊的質感,與單詞形成平滑而簡約的視覺效果。
    字母「「SMILE」」自豪地表示。顯眼的文字是圖片的焦點,吸引觀眾的注意力到其具有簡潔性的方式。該字在白色背景上顯眼地展示,與黑色字體形成鮮明對比。
    圖片中沒有其他物品或文字。字和底部的文字並沒有提供有關詞「「SMILE」具體含義的任何其他背景信息。然而,詞「「SMILE」」的整體設計使其成為這張影像中的焦點,吸引了注意力到其獨特形狀。圖片中沒有其他物品或文字。
  • Example 2 paris
  • TaiVisionLM-base-v1:
    這是一幅攝影作品,展示了巴黎的鐵塔被水景所環繞
  • TaiVisionLM-base-v2: 這張照片捕捉了巴黎,法國標誌性的塔樓和人行道景觀的令人驚嘆的景象。塔樓高聳在清澈的藍天沿著舊有大路的背景之上。它是一座高聳入雲的圓頂金屬圖案,高度被分數精確錯量。塔樓由金屬和石頭結構組成,其統一的形狀證明了其歷史意義。
    塔樓東面延伸的人行道向遠處延伸,邀請路人探索它所有的美麗。這條人行道上排列著樹木,它們翠綠的葉片與藍天形成鮮明的對比。它們的存在為場景增添了一抹綠意,為都市景觀增添了一抹自然元素。
    背景中可以看到巴黎城市景觀。各種大小和設計的建築物可以看到,它們矗立在背景中,它們的建築藝術被塔樓和人行道的視野所突顯。天空是一個清澈的藍色,它延伸到遠方,沒有任何雲彩的陰影。

這張照片是巴黎豐富歷史和現代性的一個見證。塔樓和人行道標誌著這座經典都市的地標,高聳主權人偶的高度及其證據這座城市獨特的信仰。橫跨整張照片的人行道禮貌地介紹了城市的繁忙路線。

Training Procedure

Since we don't have enough resources to train the model on the whole dataset, we only use 250k image-text pairs for training. The following training hyperparameters are used in feature alignment and task specific training stages respectively:

  • Feature Alignment
Data size Global Batch Size Learning Rate Epochs Max Length Weight Decay
250k 2 5e-5 1 2048 1e-5

We use full-parameter finetuning for the projector and apply LoRA to the language model.

We will update the training procedure once we have more resources to train the model on the whole dataset. metric

Compute Infrastructure

  • Feature Alignment 1xV100(32GB), took approximately 12 GPU hours.