File size: 4,255 Bytes
e12d1a6
 
a6d4417
 
 
 
 
 
e12d1a6
 
 
 
7ac904c
e12d1a6
929fd3b
e12d1a6
929fd3b
e12d1a6
929fd3b
e12d1a6
 
 
929fd3b
e12d1a6
3590369
 
1ed4798
 
 
3590369
 
1ed4798
 
 
3590369
 
 
e12d1a6
 
 
1224bc0
e12d1a6
 
 
44ca3e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
483657c
44ca3e1
 
 
 
 
 
 
 
 
 
483657c
44ca3e1
e12d1a6
 
 
929fd3b
e12d1a6
1224bc0
e12d1a6
1224bc0
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
library_name: transformers
license: mit
language:
- ko
base_model:
- google/gemma-2-2b-it
pipeline_tag: text-generation
---

# Model Card for Model ID

Gemma2 2b ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ v0.2.0

## Model Description

Gemma2 2b ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ๋Š” ํ•œ๊ตญ์–ด ์‚ฌํˆฌ๋ฆฌ๋ฅผ ํ‘œ์ค€์–ด๋กœ ๋ฒˆ์—ญํ•˜๊ฑฐ๋‚˜ ํ‘œ์ค€์–ด๋ฅผ ํ•œ๊ตญ์–ด ์‚ฌํˆฌ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”„๋กœ์ ํŠธ์˜ ์ผํ™˜์œผ๋กœ ๊ฐœ๋ฐœ๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. 

ํ•ด๋‹น ๋ชจ๋ธ์€ Gemma2 2b it ๋ชจ๋ธ์„ QLoRa ๊ธฐ๋ฒ•์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

## Uses

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ์„ ํ‘œ์ค€ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•˜๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๋ฐ ์ง์ ‘ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Œ์„ฑ ์ธ์‹ ๋ฐ ๋ฒˆ์—ญ ๋„๊ตฌ๋ฅผ ๊ฐœ๋ฐœํ•˜๋Š” ๊ต์œก์ž, ์–ธ์–ดํ•™์ž, ๊ธฐ์ˆ  ๊ฐœ๋ฐœ์ž์—๊ฒŒ ์œ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

### ์˜ˆ์‹œ 

| ์ž…๋ ฅ ๋ฌธ์žฅ                  | ๋ฐฉ์–ธ: ๊ฒŒ๋‚œ ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋”” | ํ‘œ์ค€์–ด: ๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ |
|:--------------------------|:----------------------------------------------------------:|:------------------------------------------:|
| ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ ์ถœ๋ ฅ             | ํ‘œ์ค€์–ด: ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ            | ๋ฐฉ์–ธ:  ๊ฒŒ๋‚œ ์ € ์–ด๋ฉ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋”” |


| ์ž…๋ ฅ ๋ฌธ์žฅ                  | ๋ฐฉ์–ธ: ์ž์ด ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค | ํ‘œ์ค€์–ด: ์žฌ ํŒ”์— ๋‹ญ์‚ด์ด ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ, ์ถ”์šด ๋ชจ์–‘์ด๋‹ค |
|:--------------------------|:----------------------------------------------------------:|:-------------------------------------:|
| ๋ฐฉ์–ธ ํ†ต์—ญ๊ธฐ ์ถœ๋ ฅ                 | ํ‘œ์ค€์–ด: ์Ÿค ํŒ”์— ๋‹ญ์‚ด์ด ๋งŽ์ด ๋‚œ ๊ฑฐ ๋ณด๋‹ˆ๊นŒ ์ถ”์šด ๋ชจ์–‘์ž…๋‹ˆ๋‹ค            | ๋ฐฉ์–ธ: ์žฌ ํด์— ๋…์†” ๋ง‰ ๋‚œ ๊ฑฐ ๋ณด๋‚œ ์–ธ ์ƒ์ด์šฐ๋‹ค |




## Bias, Risks, and Limitations

์ด ๋ชจ๋ธ์€ ํ˜„์žฌ์ œ์ฃผ ๋ฐฉ์–ธ์— ์ดˆ์ ์„ ๋งž์ถ˜ ํŠน์ • ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋งž์ถฐ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋ฐฉ์–ธ์ด๋‚˜ ์–ธ์–ด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ์ œํ•œ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

## How to Get Started with the Model

```
import transformers
import torch

model_id = "sjbaek/gemma2-2b-it-korean-dialect"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
    max_new_tokens = 512,
)


def dialect_to_standard(text, dialect_type):
        return [
            {
                "role":"user", 
                "content": "Convert the following sentence or word which is {}'s dialect to standard Korean:\n\n{}".format(dialect_type, text)
            }
        ]


def standard_to_dialect(text, dialect_type):
        return [
            {
                "role":"user", 
                "content": "Convert the following sentence or word which is standard Korean to {}'s dialect :\n\n{}".format(dialect_type, text)
            }
        ]

outputs = pipeline(
    dialect_to_standard("์šฐ๋ฆฌ ๋™์ƒ๋„ ์š”๋ฒˆ์— ์›”์š”์ผ๋‚  ๋ฏธ๊นก ํƒ€์นด๋ถ€๋Œ„ ๋‚ด๋ ค์™”๋‹น ๋ชป ํƒ€๋‚œ", "์ œ์ฃผ๋„"),
    do_sample=True,
    temperature=0.1,
    top_p=0.90,
    add_special_tokens=True
)

print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': '์šฐ๋ฆฌ ๋™์ƒ๋„ ์š”๋ฒˆ์— ์›”์š”์ผ๋‚  ๊ทค ํƒ€๊ณ  ์™”๋‹ค๊ฐ€ ๋ชป ํƒ€๋‹ˆ๊นŒ'}

outputs = pipeline(
    standard_to_dialect("๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ", "์ œ์ฃผ๋„"),
    do_sample=True,
    temperature=0.1,
    top_p=0.90,
    add_special_tokens=True
)

print(outputs[0]["generated_text"][-1])
# {'role': 'assistant', 'content': '๊ทธ๋Ÿฌ๋‹ˆ๊น ์ € ์–ด๋จธ๋‹ˆ ๋” ๋‚˜์ด ๋จน์–ด๊ฐ€๊ธฐ ์ „์— ์—ฌ๊ธฐ ์™€์•ผ ๋  ๊ฑด๋ฐ'}
```

### Training Data

[AI_HUB ์ค‘ยท๋…ธ๋…„์ธต ํ•œ๊ตญ์–ด ๋ฐฉ์–ธ ๋ฐ์ดํ„ฐ (์ถฉ์ฒญ๋„, ์ „๋ผ๋„, ์ œ์ฃผ๋„)](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71558)

## TODO

- ์ถฉ์ฒญ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.3.0)
- ์ „๋ผ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.4.0)
- ๊ฒฝ์ƒ๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v0.5.0)
- ๊ฐ•์›๋„ ๋ฐฉ์–ธ ๋ณ€ํ™˜ ๊ธฐ๋Šฅ (v1.0.0)