README.md · beomi/Mistral-Ko-Inst-dev at main

metadata

license: apache-2.0
datasets:
  - heegyu/kowikitext
  - heegyu/kowiki-sentences
language:
  - ko
  - en
library_name: transformers
tags:
  - pytorch

Experimental Repository :)

Contents will updated without any notice at all. If you plan to use this repository, please use with revision with git hash.

This experiment is aimed to:

Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
Adapt new Korean vocab seamlessly
Use minimal dataset (used Korean wikipedia only)
Computationally efficient method
Let model answer using English knowledge and NLU capability even the question/answer is Korean only.

Here's some test:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'beomi/Mistral-Ko-Inst-dev',
    torch_dtype='auto',
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')

pipe = pipeline(
    'text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    do_sample=True,
    max_new_tokens=350, 
    return_full_text=False,
    no_repeat_ngram_size=6,
    eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)


def gen(x):
    chat = tokenizer.apply_chat_template([
        {"role": "user", "content": x},
        # {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
        # {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
    ], tokenize=False)
    print(pipe(chat)[0]['generated_text'].strip())

gen("스타벅스와 스타벅스 코리아의 차이는?")

# (생성 예시)
# 스타벅스는 전 세계적으로 운영하고 있는 커피 전문사이다. 한국에는 스타벅스 코리아라는 이름으로 운영되고 있다.
# 스타벅스 코리아는 대한민국에 입점한 이후 2009년과 2010년에 두 차례의 브랜드과의 재검토 및 새로운 디자인을 통해 새로운 브랜드다. 커피 전문의 프리미엄 이미지를 유지하고 있고, 스타벅스 코리아는 한국을 대표하는 프리미엄 커피 전문 브랜드을 만들고 있다.