README.md · cyberagent/llava-calm2-siglip at main

metadata

license: apache-2.0
language:
  - ja
  - en
pipeline_tag: image-to-text

Model Description

llava-calm2-siglip is an experimental Vision Language Model that can answer questions in Japanese about images.

Usage

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained(
    "cyberagent/llava-calm2-siglip",
    torch_dtype=torch.bfloat16,
).to(0)

processor = AutoProcessor.from_pretrained("cyberagent/llava-calm2-siglip")

prompt = """USER: <image>
この画像を説明してください。
ASSISTANT: """

url = "https://unsplash.com/photos/LipkIP4fXbM/download?force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

inputs = processor(text=prompt, images=image, return_tensors="pt").to(0, torch.bfloat16)
generate_ids = model.generate(**inputs,
                              max_length=500,
                              do_sample=True,
                              temperature=0.2,
                    )
output = processor.tokenizer.decode(generate_ids[0][:-1], clean_up_tokenization_spaces=False)

print(output)

# USER: <image>
# この画像を説明してください。
# ASSISTANT: 画像には、木製のテーブルの上に置かれた、たこ焼き器で焼かれた3つのたこ焼きが映っています。たこ焼きは、小麦粉をベースにした生地を丸く焼き、中にタコや天かす、紅ショウガなどの具材を入れたものです。たこ焼きは、ソース、マヨネーズ、青海苔、かつおぶしをかけて食べることが多いです。

Chat Template

USER: <image>
{user_message1}
ASSISTANT: {assistant_message1}<|endoftext|>
USER: {user_message2}
ASSISTANT: {assistant_message2}<|endoftext|>
USER: {user_message3}
ASSISTANT: {assistant_message3}<|endoftext|>

Model Details

Model size: 7B
Model type: Transformer-based Vision Language Model
Language(s): Japanese, English
Developed by: CyberAgent, Inc.
License: Apache-2.0

Training

This model is a visual language instruction-following model based on LLaVA 1.5. It utilizes cyberagent/calm2-7b-chat as its language model and google/siglip-so400m-patch14-384 as its image encoder. During training, the first stage involved learning the MLP projection from scratch, which was followed by additional training of both the language model and the MLP projection in the second stage.

Dataset for Visual Instruction Tuning

In the second stage of Visual Instruction Tuning, we train on a dataset of conversations about images. These conversational data are generated using our in-house large-scale Japanese language model, based on images, captions, object labels, and bounding boxes from the MS-COCO and VisualGenome. For methods of generating conversational datasets for Visual Instruction Tuning without using images, please refer to LLaVA 1.5.

Evaluation Results

LLaVA Bench In-the-wild

Model	Detail	Conv	Complex	Average
llava-calm2-siglip	51.2	55.9	65.51	57.54
Japanese Stable VLM	26.02	24.84	29.18	26.68
SakanaAI EvoVLM-JP	49.59	65.49	54.22	56.43
Heron BLIP v1 (620k)	45.45	32.90	56.89	45.08
Heron GIT	40.98	39.87	54.59	45.15

LLaVA Bench In-the-wild translated into Japanese.

Heron-Bench

Model	Detail	Conv	Complex	Average
llava-calm2-siglip	53.42	50.13	52.72	52.09
Japanese Stable VLM	25.15	51.23	37.84	38.07
SakanaAI EvoVLM-JP	50.31	44.42	40.47	45.07
Heron BLIP v1 (620k)	49.09	41.51	45.72	45.44
Heron GIT	42.77	54.20	43.53	46.83

Heron-Bench

Use and Limitations

Intended Use

This model is designed for use by the open-source community in vision-language applications and academic research.

Limitations and biases

This model, a general-purpose Japanese VLM, reaches optimal performance when specifically tuned with relevant data for each task. Though technically possible, commercial use is advised with caution, and the implementation of mechanisms to filter out inappropriate content is strongly recommended when deployed in production systems. This model is not advisable for use in applications that could potentially harm individuals or groups, or cause distress. CyberAgent expressly disclaims any liability for direct, indirect, special, incidental, or consequential damages, as well as for any losses that may result from using this model, regardless of the outcomes. Users must fully understand these limitations before employing the model.

Author

Aozora Inagaki