sjbaek's picture
Update README.md
1ed4798 verified
|
raw
history blame
4.26 kB
metadata
library_name: transformers
license: mit
language:
  - ko
base_model:
  - google/gemma-2-2b-it
pipeline_tag: text-generation

Model Card for Model ID

Gemma2 2b ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ v0.2.0

Model Description

Gemma2 2b ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ๋Š” ํ•œ๊ตญ์–ด ์‚ฌํˆฌ๋ฆฌ๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ฒˆ์—ญํ•˜๊ฑฐ๋‚˜ ํ‘œ์ค€์–ด๋ฅผ ํ•œ๊ตญ์–ด ์‚ฌํˆฌ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋ชจ๋ธ์€ Gemma2 2b it ๋ชจ๋ธ์„ QLoRa ๊ธฐ๋ฒ•์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Uses

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ์„ ํ‘œ์ค€ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•˜๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๋ฐ ์ง์ ‘ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ๋ฐ ๋ฒˆ์—ญ ๋„๊ตฌ๋ฅผ ๊ฐœ๋ฐœํ•˜๋Š” ๊ต์œก์ž, ์–ธ์–ดํ•™์ž, ๊ธฐ์ˆ  ๊ฐœ๋ฐœ์ž์—๊ฒŒ ์œ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ์‹œ

์ž…๋ ฅ ๋ฌธ์žฅ ๋ฐฉ์–ธ: ๊ฒŒ๋‚œ ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋”” ํ‘œ์ค€์–ด: ๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ
๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ ์ถœ๋ ฅ ํ‘œ์ค€์–ด: ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ ๋ฐฉ์–ธ: ๊ฒŒ๋‚œ ์ € ์–ด๋ฉ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋””
์ž…๋ ฅ ๋ฌธ์žฅ ๋ฐฉ์–ธ: ์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค ํ‘œ์ค€์–ด: ์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค
๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ ์ถœ๋ ฅ ํ‘œ์ค€์–ด: ์Ÿค ํŒ”์— ๋‹ญ์‚ด์ด ๋งŽ์ด ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ๊นŒ ์ถ”์šด ๋ชจ์–‘์ž…๋‹ˆ๋‹ค ๋ฐฉ์–ธ: ์žฌ ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค

Bias, Risks, and Limitations

์ด ๋ชจ๋ธ์€ ํ˜„์žฌ์ œ์ฃผ ๋ฐฉ์–ธ์— ์ดˆ์ ์„ ๋งž์ถ˜ ํŠน์ • ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋งž์ถฐ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋ฐฉ์–ธ์ด๋‚˜ ์–ธ์–ด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ์ œํ•œ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

How to Get Started with the Model

import transformers
import torch

model_id = "sjbaek/gemma2-2b-it-korean-dialect"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    max_new_tokens = 512,
)


def dialect_to_standard(text, dialect_type):
        return [
            {
                "role":"user", 
                "content": "Convert the following sentence or word which is {}'s dialect to standard Korean:\n\n{}".format(dialect_type, text)
            }
        ]


def standard_to_dialect(text, dialect_type):
        return [
            {
                "role":"user", 
                "content": "Convert the following sentence or word which is standard Korean to {}'s dialect :\n\n{}".format(dialect_type, text)
            }
        ]

outputs = pipeline(
    dialect_to_standard("์šฐ๋ฆฌ ๋™์ƒ๋„ ์š”๋ฒˆ์— ์›”์š”์ผ๋‚  ๋ฏธ๊นก ํƒ€์นด๋ถ€๋Œ„ ๋‚ด๋ ค์™”๋‹น ๋ชป ํƒ€๋‚œ", "์ œ์ฃผ๋„"),
    do_sample=True,
    temperature=0.1,
    top_p=0.90,
    add_special_tokens=True
)

print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': '์šฐ๋ฆฌ ๋™์ƒ๋„ ์š”๋ฒˆ์— ์›”์š”์ผ๋‚  ๊ทค ํƒ€๊ณ  ์™”๋‹ค๊ฐ€ ๋ชป ํƒ€๋‹ˆ๊นŒ'}

outputs = pipeline(
    standard_to_dialect("๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ", "์ œ์ฃผ๋„"),
    do_sample=True,
    temperature=0.1,
    top_p=0.90,
    add_special_tokens=True
)

print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': '๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ'}

Training Data

AI_HUB ์ค‘ยท๋…ธ๋…„์ธต ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ (์ถฉ์ฒญ๋„, ์ „๋ผ๋„, ์ œ์ฃผ๋„)

TODO

  • ์ถฉ์ฒญ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.3.0)
  • ์ „๋ผ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.4.0)
  • ๊ฒฝ์ƒ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.5.0)
  • ๊ฐ•์›๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v1.0.0)