metadata
license: apache-2.0
datasets:
- heegyu/kowikitext
- heegyu/kowiki-sentences
language:
- ko
- en
library_name: transformers
tags:
- pytorch
Experimental Repository :)
Contents will updated without any notice at all. If you plan to use this repository, please use with revision
with git hash.
This experiment is aimed to:
- Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
- Adapt new Korean vocab seamlessly
- Use minimal dataset (used Korean wikipedia only)
- Computationally efficient method
- Let model answer using English knowledge and NLU capability even the question/answer is Korean only.
Here's some test:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'beomi/Mistral-Ko-Inst-dev',
torch_dtype='auto',
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')
pipe = pipeline(
'text-generation',
model=model,
tokenizer=tokenizer,
do_sample=True,
max_new_tokens=350,
return_full_text=False,
no_repeat_ngram_size=6,
eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)
def gen(x):
chat = tokenizer.apply_chat_template([
{"role": "user", "content": x},
# {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
# {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
], tokenize=False)
print(pipe(chat)[0]['generated_text'].strip())
gen("μ€νλ²
μ€μ μ€νλ²
μ€ μ½λ¦¬μμ μ°¨μ΄λ?")
# (μμ± μμ)
# μ€νλ²
μ€λ μ μΈκ³μ μΌλ‘ μ΄μνκ³ μλ μ»€νΌ μ λ¬Έμ¬μ΄λ€. νκ΅μλ μ€νλ²
μ€ μ½λ¦¬μλΌλ μ΄λ¦μΌλ‘ μ΄μλκ³ μλ€.
# μ€νλ²
μ€ μ½λ¦¬μλ λνλ―Όκ΅μ μ
μ ν μ΄ν 2009λ
κ³Ό 2010λ
μ λ μ°¨λ‘μ λΈλλκ³Όμ μ¬κ²ν λ° μλ‘μ΄ λμμΈμ ν΅ν΄ μλ‘μ΄ λΈλλλ€. μ»€νΌ μ λ¬Έμ ν리미μ μ΄λ―Έμ§λ₯Ό μ μ§νκ³ μκ³ , μ€νλ²
μ€ μ½λ¦¬μλ νκ΅μ λννλ ν리미μ μ»€νΌ μ λ¬Έ λΈλλμ λ§λ€κ³ μλ€.