metadata
language: ko
Pretrained BART in Korean
This is pretrained BART model with multiple Korean Datasets.
I used multiple datasets for generalizing the model for both colloquial and written texts.
The training is supported by TPU Research Cloud program.
The script which is used to pre-train model is here.
When you use the reference API, you must wrap the sentence with [BOS]
and [EOS]
like below example.
[BOS] ์๋
ํ์ธ์? ๋ฐ๊ฐ์์~~ [EOS]
You can also test mask filling performance using [MASK]
token like this.
[BOS] [MASK] ๋จน์์ด? [EOS]
Benchmark
Dataset | KLUE NLI dev | NSMC test | QuestionPair test | KLUE TC dev | KLUE STS dev | KorSTS dev | HateSpeech dev | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric | Acc | Acc | Acc | Acc | F1 | F1 | Pearson | Spearman | F1 | Pearson | Spearman | Bias Acc | Hate Acc |
Score | 0.5253 | 0.8425 | 0.8945 | 0.8047 | 0.7988 | 0.7411 | 0.7471 | 0.7399 | 0.7725 | 0.6503 | 0.6191 | 0.7537 | 0.5605 |
- The performance was measured using the notebooks here with colab.
Used Datasets
๋ชจ๋์ ๋ง๋ญ์น
- ์ผ์ ๋ํ ๋ง๋ญ์น 2020
- ๊ตฌ์ด ๋ง๋ญ์น
- ๋ฌธ์ด ๋ง๋ญ์น
- ์ ๋ฌธ ๋ง๋ญ์น
AIhub
- ๊ฐ๋ฐฉ๋ฐ์ดํฐ ์ ๋ฌธ๋ถ์ผ๋ง๋ญ์น
- ๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด๋ํ์์ฝ
- ๊ฐ๋ฐฉ๋ฐ์ดํฐ ๊ฐ์ฑ ๋ํ ๋ง๋ญ์น
- ๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด ์์ฑ
- ๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด SNS