|
--- |
|
language: |
|
- ko |
|
pipeline_tag: text2text-generation |
|
license: apache-2.0 |
|
--- |
|
|
|
## ํ๊ตญ์ด ๋ง์ถค๋ฒ ๊ต์ ๊ธฐ(Korean Typos Corrector) |
|
- ETRI-et5 ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก fine-tuningํ ํ๊ตญ์ด ๊ตฌ์ด์ฒด ์ ์ฉ ๋ง์ถค๋ฒ ๊ต์ ๊ธฐ ์
๋๋ค. |
|
|
|
## Base on PLM model(ET5) |
|
- ETRI(https://aiopen.etri.re.kr/et5Model) |
|
|
|
## Base on Dataset |
|
- ๋ชจ๋์ ๋ง๋ญ์น(https://corpus.korean.go.kr/request/reausetMain.do?lang=ko) ๋ง์ถค๋ฒ ๊ต์ ๋ฐ์ดํฐ |
|
|
|
## Data Preprocessing |
|
- 1. ํน์๋ฌธ์ ์ ๊ฑฐ (์ผํ) .(๋ง์นจํ) ์ ๊ฑฐ |
|
- 2. null ๊ฐ("") ์ ๊ฑฐ |
|
- 3. ๋๋ฌด ์งง์ ๋ฌธ์ฅ ์ ๊ฑฐ(๊ธธ์ด 2 ์ดํ) |
|
- 4. ๋ฌธ์ฅ ๋ด &name&, name1 ๋ฑ ์ด๋ฆ ํ๊ทธ๊ฐ ํฌํจ๋ ๋จ์ด ์ ๊ฑฐ(๋จ์ด๋ง ์ ๊ฑฐํ๊ณ ๋ฌธ์ฅ์ ์ด๋ฆผ) |
|
- total : 318,882 ์ |
|
|
|
*** |
|
|
|
## How to use |
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
# T5 ๋ชจ๋ธ ๋ก๋ |
|
model = T5ForConditionalGeneration.from_pretrained("j5ng/et5-typos-corrector") |
|
tokenizer = T5Tokenizer.from_pretrained("j5ng/et5-typos-corrector") |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
# device = "mps:0" if torch.cuda.is_available() else "cpu" # for mac m1 |
|
|
|
model = model.to(device) |
|
|
|
# ์์ ์
๋ ฅ ๋ฌธ์ฅ |
|
input_text = "์๋ฌ ์ง์ง ๋ฌดใ
ํ๋๊ณ " |
|
|
|
# ์
๋ ฅ ๋ฌธ์ฅ ์ธ์ฝ๋ฉ |
|
input_encoding = tokenizer("๋ง์ถค๋ฒ์ ๊ณ ์ณ์ฃผ์ธ์: " + input_text, return_tensors="pt") |
|
|
|
input_ids = input_encoding.input_ids.to(device) |
|
attention_mask = input_encoding.attention_mask.to(device) |
|
|
|
# T5 ๋ชจ๋ธ ์ถ๋ ฅ ์์ฑ |
|
output_encoding = model.generate( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
max_length=128, |
|
num_beams=5, |
|
early_stopping=True, |
|
) |
|
|
|
# ์ถ๋ ฅ ๋ฌธ์ฅ ๋์ฝ๋ฉ |
|
output_text = tokenizer.decode(output_encoding[0], skip_special_tokens=True) |
|
|
|
# ๊ฒฐ๊ณผ ์ถ๋ ฅ |
|
print(output_text) # ์๋ ์ง์ง ๋ญ ํ๋๊ณ . |
|
``` |
|
|
|
*** |
|
|
|
## With Transformer Pipeline |
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer, pipeline |
|
|
|
model = T5ForConditionalGeneration.from_pretrained('j5ng/et5-typos-corrector') |
|
tokenizer = T5Tokenizer.from_pretrained('j5ng/et5-typos-corrector') |
|
|
|
typos_corrector = pipeline( |
|
"text2text-generation", |
|
model=model, |
|
tokenizer=tokenizer, |
|
device=0 if torch.cuda.is_available() else -1, |
|
framework="pt", |
|
) |
|
|
|
input_text = "์์ฃค ์ด์ด์
ใ
๋ค์ง์จฌใ
ใ
ใ
" |
|
output_text = typos_corrector("๋ง์ถค๋ฒ์ ๊ณ ์ณ์ฃผ์ธ์: " + input_text, |
|
max_length=128, |
|
num_beams=5, |
|
early_stopping=True)[0]['generated_text'] |
|
|
|
print(output_text) # ์์ ์ด์ด์๋ค ์ง์ง แแแแ. |
|
``` |