Mistral-Ko-Inst-dev / README.md
beomi's picture
Update README.md
333a4c3 verified
metadata
license: apache-2.0
datasets:
  - heegyu/kowikitext
  - heegyu/kowiki-sentences
language:
  - ko
  - en
library_name: transformers
tags:
  - pytorch

Experimental Repository :)

Contents will updated without any notice at all. If you plan to use this repository, please use with revision with git hash.

This experiment is aimed to:

  • Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
  • Adapt new Korean vocab seamlessly
  • Use minimal dataset (used Korean wikipedia only)
  • Computationally efficient method
  • Let model answer using English knowledge and NLU capability even the question/answer is Korean only.

Here's some test:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'beomi/Mistral-Ko-Inst-dev',
    torch_dtype='auto',
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')

pipe = pipeline(
    'text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    do_sample=True,
    max_new_tokens=350, 
    return_full_text=False,
    no_repeat_ngram_size=6,
    eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)


def gen(x):
    chat = tokenizer.apply_chat_template([
        {"role": "user", "content": x},
        # {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
        # {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
    ], tokenize=False)
    print(pipe(chat)[0]['generated_text'].strip())

gen("μŠ€νƒ€λ²…μŠ€μ™€ μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„μ˜ μ°¨μ΄λŠ”?")

# (생성 μ˜ˆμ‹œ)
# μŠ€νƒ€λ²…μŠ€λŠ” μ „ μ„Έκ³„μ μœΌλ‘œ μš΄μ˜ν•˜κ³  μžˆλŠ” 컀피 전문사이닀. ν•œκ΅­μ—λŠ” μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λΌλŠ” μ΄λ¦„μœΌλ‘œ 운영되고 μžˆλ‹€.
# μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” λŒ€ν•œλ―Όκ΅­μ— μž…μ ν•œ 이후 2009λ…„κ³Ό 2010년에 두 μ°¨λ‘€μ˜ λΈŒλžœλ“œκ³Όμ˜ μž¬κ²€ν†  및 μƒˆλ‘œμš΄ λ””μžμΈμ„ 톡해 μƒˆλ‘œμš΄ λΈŒλžœλ“œλ‹€. 컀피 μ „λ¬Έμ˜ 프리미엄 이미지λ₯Ό μœ μ§€ν•˜κ³  있고, μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” ν•œκ΅­μ„ λŒ€ν‘œν•˜λŠ” 프리미엄 컀피 μ „λ¬Έ λΈŒλžœλ“œμ„ λ§Œλ“€κ³  μžˆλ‹€.