JEJUMA-001 / README.md
KOJUNSEO's picture
Update README.md
9c0c681 verified
|
raw
history blame
6.96 kB
metadata
license: mit
language:
  - ko
pipeline_tag: text-generation
tags:
  - Language
  - Dialect

JEJUMA-001

LLM์œผ๋กœ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์šฐ๋ฆฌ ๋ฐฉ์–ธ ์ง€ํ‚ค๊ธฐ ํ”„๋กœ์ ํŠธ1: ์ œ์ฃผ๋„ ๋ฐฉ์–ธ

์™œ ์‹œ์ž‘ํ•˜๊ฒŒ ๋˜์—ˆ๋‚˜์š”?

๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ ธ๊ฐ€๋Š” ์ง€์—ญ๋ฐฉ์–ธ: ์ œ์ฃผ๋„

  • ์—ฌ๋Ÿฌ ์ง€์—ญ ๋ฐฉ์–ธ, ํŠนํžˆ ์ œ์ฃผ๋„์˜ ๋ฐฉ์–ธ์ด ๋น ๋ฅด๊ฒŒ ์‚ฌ๋ผ์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ ๋„ค์Šค์ฝ”๋Š” ์ œ์ฃผ์–ด(์ œ์ฃผ๋ฐฉ์–ธ)์„ ์•„์ฃผ ์‹ฌ๊ฐํ•˜๊ฒŒ ์œ„๊ธฐ์— ์ฒ˜ํ•œ ์–ธ์–ด ๋กœ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ œ์ฃผ๋„๋ฏผ ์ค‘ ์ œ์ฃผ์–ด๋ฅผ ์•„๋Š” ์‚ฌ๋žŒ์˜ ๋น„์œจ์€ 36.1% ์— ๊ทธ์น˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŠนํžˆ, ํƒ€์ง€์—ญ๊ณผ์˜ ๊ต๋ฅ˜๊ฐ€ ํ™œ๋ฐœํ•ด์ง€๋ฉด์„œ ์ Š์€ ์ธต์—์„  ์ œ์ฃผ์–ด๋ณด๋‹จ ํ‘œ์ค€์–ด๋ฅผ ์„ ํ˜ธํ•˜๋Š” ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

์ง€์—ญ๋ฐฉ์–ธ์— ์•ฝํ•œ ์–ธ์–ด๋ชจ๋ธ

  • ์˜จ๋ผ์ธ ์†Œ์Šค๋Š” ํ‘œ์ค€์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ์—, ์ž๋ฃŒ๊ฐ€ ์ ์€ ์ง€์—ญ๋ฐฉ์–ธ์„ ์ž˜ ๋ชจ๋ฆ…๋‹ˆ๋‹ค.
  • ํŠนํžˆ ์ œ์ฃผ์–ด๋Š” ํ‘œ์ค€์–ด์™€ ์ฐจ์ด๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์—, ์œ ๋ช…ํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ ์™ธ์—๋Š” ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ์ด๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‚˜์š”?

  • ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ์–ด๋ ค์šด ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์ œ์ฃผ์–ด๊ฐ€ ์žŠํ˜€์ง€์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ์–ธ์–ด๋ชจ๋ธ์„ ํ†ตํ•ด ํ‘œ์ค€์–ด์˜ ์ œ์ฃผ์–ด ๋ฒ„์ „์„ ์ƒ์„ฑํ•˜์—ฌ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” ๊ธฐ์กด์— ํ•™์Šต๋œ ๋‹ค์–‘ํ•œ ๋‚ด์šฉ์„ ๊ทธ๋Œ€๋กœ ์ด์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

๊ฐœ๋ฐœํ•œ ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•œ ์„ค๋ช…

  • ์ œ์ฃผ๋„ ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Llama3.1์„ ๋‹ค์–‘ํ•œ ํ…Œ์Šคํฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋„๋ก ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ, ์ œ์ฃผ๋„ ๋ฐฉ์–ธ๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • JEJUMA-001์€ ํ˜„์žฌ ๋ฐฉ์–ธ๊ณผ ํ‘œ์ค€์–ด๊ฐ„ ๋ณ€๊ฒฝ, ๋ฐฉ์–ธ ํƒ์ง€ ๋“ฑ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • JEJUMA-001์„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ 105๋งŒ๊ฐœ์˜ ์ œ์ฃผ๋ฐฉ์–ธ-์„œ์šธ๋ง ํŽ˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ๊ทธ ์ค‘ ์ œ์ฃผ์–ด๊ฐ€ ์ž˜ ๋“ค์–ด๋‚œ ๋ฐ์ดํ„ฐ 17๋งŒ๊ฐœ๋ฅผ ์„ ๋ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด 4๊ฐ€์ง€์˜ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์ด ์•ฝ 34๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
  • LlamaFactory๋ฅผ ํ†ตํ•ด LoRA ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•˜์˜€์œผ๋ฉฐ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 1์—ํญ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ์–ด๋ ค์šด ์ œ์ฃผ๋„ ๋ง์— ๋Œ€ํ•ด์„œ, gpt4o์™€ ๊ตญ์‚ฐ ๋ชจ๋ธ์ธ ์—…์Šคํ…Œ์ด์ง€ Solar, ๋„ค์ด๋ฒ„ HCX ๋†’์€ ๋ฒˆ์—ญ ์ •ํ™•๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด

์ž…๋ ฅ ๋ฌธ์žฅ ์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค
์ •๋‹ต ์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค.
Upstage Solar ์ถœ๋ ฅ ๊ทธ ๋ฐ”์œ„์— ๋ฑ€์ด ๋‚˜ํƒ€๋‚˜๋Š” ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.
Naver HCX ์ถœ๋ ฅ ์žฌ์˜ ํ’€์— ๋…์ดˆ๊ฐ€ ๋งˆ๊ตฌ ๋‚œ ๊ฒƒ์„ ๋ณด๋‹ˆ ์–ด๋ฆฐ ์†Œ๋‚˜๋ฌด์ž…๋‹ˆ๋‹ค.
GPT-4o ์ถœ๋ ฅ ์ €๊ธฐ ๋ฐ”์œ„์— ๋…์‚ฌ๊ฐ€ ๋ง‰ ๋‚˜ํƒ€๋‚œ ๊ฑธ ๋ณด๋‹ˆ๊นŒ ์ •๋ง ๋†€๋ž๋‹ค.
JEJUMA-001 ์ถœ๋ ฅ

ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด

์ž…๋ ฅ ๋ฌธ์žฅ ๊ทค๋‚˜๋ฌด์— ๊ทธ๋ƒฅ ๊ฐ€์„œ ๋„ˆ๋„ค ์•„๋ฒ„์ง€์ข€ ์ฐพ์•„์™€๋ผ.
์ •๋‹ต ๋ฏธ๊นก๋‚ญ ๊ฒฝ ๊ฐ€์‹ฌ ๋„ˆ๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น
Upstage Solar ์ถœ๋ ฅ ๊ทค ๋‚˜๋ฌด์— ๊ฐ€์„œ ๋„ค ์•„๋ฒ„์ง€๋ฅผ ์ข€ ์ฐพ์•„์™€.
Naver HCX ์ถœ๋ ฅ ๊ทค๋‚ญ์— ๊ฐ• ๋Š๋„ค ์•„๋ฐฉ ์ข€ ๋ฐ๋ น์˜ค๋ผ.
GPT-4o ์ถœ๋ ฅ ๊ทค๋‚˜๋ฌด์— ๊ฑ ๊ฐ€์„œ ํ–„์‹  ์•„๋ฐฉ ์ข€ ์ฐพ์•„์™€๋ผ.
JEJUMA-001 ์ถœ๋ ฅ ๋ฏธ๊นก๋‚ญ์— ๊ทธ๋ƒฅ ๊ฐ• ๋„ˆ๋„ค ์•„๋ฐฉ์ข€ ์ดž์•„์˜ค๋ผ

์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•˜๋‚˜์š”?

  • ์ •์˜๋œ ํƒฌํ”Œ๋ฆฟ์—์„œ dialect_to_standard, standard_to_dialect, detect_dialect, detect_dialect_and_convert ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • dialect_to_standard: ์ œ์ฃผ์–ด๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ณ€๊ฒฝ
  • standard_to_dialect: ํ‘œ์ค€์–ด๋ฅผ ์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
  • detect_dialect: ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด ๊ฐ์ง€
  • detect_dialect_and_convert: ์ž๋™์œผ๋กœ ์ œ์ฃผ์–ด/ํ‘œ์ค€์–ด๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ํ‘œ์ค€์–ด/์ œ์ฃผ์–ด๋กœ ๋ณ€๊ฒฝ
import transformers
import torch

model_id = "JEJUMA/JEJUMA-001"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

class JejuPromptTemplate:
    @staticmethod
    def dialect_to_standard(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is Jeju island dialect to standard Korean: " + text},]

    @staticmethod
    def standard_to_dialect(text):
        return [{"role":"user", "content":"Convert the following sentence or word which is standard Korean to Jeju island dialect: " + text},]

    @staticmethod
    def detect_dialect(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean: " + text},]

    @staticmethod
    def detect_dialect_and_convert(text):
        return [{"role":"user", "content":"Detect the following sentence or word is Jeju island dialect or standard Korean and convert the following sentence or word to Jeju island dialect or standard Korean: " + text},]


outputs = pipeline(
    JejuPromptTemplate.standard_to_dialect("์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค"),
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
)

print(outputs[0]["generated_text"][-1])

์ถ”ํ›„ ๊ณ„ํš

  • JEJUMA-002๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•˜์—ฌ ๋™์ผํ•œ ํ…Œ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
  • JEJUMA-003๋Š” ๊ตญ๋‚ด์˜ ๋ชจ๋“  ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ์™€ ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ…Œ์Šคํฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์— ์—†๋Š” ๋ฐฉ์–ธ(์—ฐ๋ณ€๋ฐฉ์–ธ, ๋ถํ•œ์–ด, ์ œ3์˜ ์–ธ์–ด)๋ฅผ ์ผ๋ถ€ ๋ฒˆ์—ญํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.
  • JEJUMA-003์ด ๋ณธ ์—ฐ๊ตฌ์— ์ตœ์ข… ๋‹จ๊ณ„์ด๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ๋ฒˆ์—ญ๋ชจ๋ธ์ด๋‚˜ ๋” ์ž‘์€ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ 8B ํฌ๊ธฐ์˜ ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.