Pixtral-12b-korean-preview

Finetunned with korean, english data for improving korean performance.

Model Card for Model ID

Merged model using mergekit

This model hasn't been fully tested, so your feedback will be invaluable in improving it.

Merge Format

models:
  - model: spow12/Pixtral-12b-korean-base(private)
    layer_range: [0, 40]
  - model: mistral-community/pixtral-12b
    layer_range: [0, 40]
merge_method: slerp
base_model: mistral-community/pixtral-12b
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5 # fallback for rest of tensors
dtype: bfloat16

Model Details

Model Description

Developed by: spow12(yw_nam)
Shared by : spow12(yw_nam)
Model type: LLaVA
Language(s) (NLP): Korean, English
Finetuned from model : mistral-community/pixtral-12b

Usage

Single image inference

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

model_id =  'spow12/Pixtral-12b-korean-preview'
model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    device_map='auto', 
    torch_dtype = torch.bfloat16, 
).eval()
model.tie_weights()
processor = AutoProcessor.from_pretrained(model_id)

system = "You are helpful assistant create by Yw nam"


chat = [
    {
        'content': system,
        'role': 'system'
    },
    {
        "role": "user", "content": [
        {"type": "image"},  
        {"type": "text", "content": "이 이미지에 나와있는 풍경을 설명해줘"}, 
        ]
    }
]
url = "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSXVmCeFm5GRrciuGCM502uv9xXVSrS9zDJZ1umCfoMero2MLxT"
image = Image.open(requests.get(url, stream=True).raw)

images = [[image]]
prompt = processor.apply_chat_template(chat, tokenize=False)

inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500,do_sample=True,min_p=0.1, temperature=0.9)
output = processor.batch_decode(generate_ids, skip_special_tokens=True,clean_up_tokenization_spaces=False)
print(output[0])

#Output
"""이 이미지는 바위 해안에 위치한 작은 섬에 위치한 고요한 해안 경치를 보여줍니다. 이 섬은 푸른 물로 둘러싸여 있으며, 그 위에는 붉은 지붕이 있는 하얀 등대가 서 있습니다. 등대는 섬의 중앙에 위치해 있으며, 바위 절벽과 연결된 돌다리가 이어져 있어 접근할 수 있습니다. 등대 주변의 바위 절벽은 파도가 부딪히며 장면에 역동적인 요소를 더합니다. 등대 너머로는 하늘이 맑고 푸르며, 전체적인 장면은 평화롭고 고요한 분위기를 자아냅니다."""

Multi image inference

url_apple = "https://cloud.shopback.com/c_fit,h_750,w_750/store-service-tw/assets/20185/0476e480-b6c3-11ea-b541-2ba549204a69.png"
image_1 = Image.open(requests.get(url_apple, stream=True).raw)
url_microsoft = "https://pbs.twimg.com/profile_images/1268196215587397634/sgD5ZWuO_400x400.png"
image_2 = Image.open(requests.get(url_microsoft, stream=True).raw)
chat = [
    {
        'content': system,
        'role': 'system'
    },
    {
        "role": "user", "content": [
        {"type": "image"},  
        {"type": "image"},  
        {"type": "text", "content": "두 기업에 대해서 아는걸 설명해줘."}, 
        ]
    }
]

images = [[image_1, image_2] ]
prompt = processor.apply_chat_template(chat, tokenize=False)
inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, min_p=0.1)
output = processor.batch_decode(generate_ids, skip_special_tokens=True,clean_up_tokenization_spaces=False)
print(output[0])


#Output
"""두 기업은 각각 Apple과 Microsoft입니다.

1. 애플:
애플은 1976년에 스티브 잡스, 스티브 워즈니악, 로널드 웨인에게 설립된 미국의 다국적 기술 기업입니다. 애플의 주요 제품으로는 iPhone, iPad, Mac, Apple Watch가 있습니다. 이 회사는 혁신적인 디자인, 사용자 친화적인 인터페이스, 고품질의 하드웨어로 유명합니다. 애플은 또한 Apple Music, iCloud, App Store와 같은 다양한 소프트웨어 서비스와 플랫폼을 제공합니다. 애플은 혁신적인 제품과 강력한 브랜드로 잘 알려져 있으며, 2010년대 이후 세계에서 가장 가치 있는 기업 중 하나로 자리매김했습니다.

2. 마이크로소프트:
마이크로소프트는 1975년에 빌 게이츠와 폴 알렌에 의해 설립된 미국의 다국적 기술 기업입니다. 이 회사는 운영 체제, 소프트웨어, 개인용 컴퓨터, 전자제품 개발에 중점을 둡니다. 마이크로소프트의 주요 제품으로는 Windows 운영 체제, Microsoft Office 제품군, Xbox 게임 콘솔이 있습니다. 이 회사는 소프트웨어 개발, 클라우드 컴퓨팅, 인공지능 연구와 같은 분야에서도 중요한 역할을 하고 있습니다. 마이크로소프트는 혁신적인 기술과 강력한 비즈니스 솔루션으로 잘 알려져 있으며, 세계에서 가장 가치 있는 기업 중 하나로 자리매김했습니다"""

Limitation

Overall, the performance seems reasonable.

However, it declines when processing images with non enlgish image.

This is likely because the model was trained primarily on English text and landscapes.

Adding Korean data in the future is expected to enhance performance.

Citation

@misc {spow12/Pixtral-12b-korean-preview,
    author       = { YoungWoo Nam },
    title        = { spow12/Pixtral-12b-korean-preview },
    year         = 2024,
    url          = { https://huggingface.co/spow12/Pixtral-12b-korean-preview },
    publisher    = { Hugging Face }
}

spow12
/

Pixtral-12b-korean-preview