Update README.md
Browse filesHi! This PR has some optional additions to your model card, based on the format we are using as part of our effort to standardise model cards at Hugging Face. Your additional input regarding KoELECTRA v3 uses( direct and indirect - Misuse, Malicious Use, and Out-of-Scope Use) would be appreciated.
Feel free to merge if you are ok with the changes! (cc
@Marissa
@Meg
)
README.md
CHANGED
@@ -1,17 +1,38 @@
|
|
1 |
---
|
2 |
language: ko
|
3 |
license: apache-2.0
|
|
|
|
|
|
|
4 |
tags:
|
5 |
- korean
|
6 |
---
|
7 |
|
8 |
-
# KoELECTRA v3 (Base Discriminator)
|
9 |
|
10 |
-
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
-
## Usage
|
15 |
|
16 |
### Load model and tokenizer
|
17 |
|
@@ -53,3 +74,93 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
|
|
53 |
|
54 |
print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
|
55 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
language: ko
|
3 |
license: apache-2.0
|
4 |
+
datasets:
|
5 |
+
- wordpiece
|
6 |
+
- everyones-corpus
|
7 |
tags:
|
8 |
- korean
|
9 |
---
|
10 |
|
|
|
11 |
|
12 |
+
# KoELECTRA v3 (Base Discriminator)
|
13 |
|
14 |
+
## Table of Contents
|
15 |
+
1. [Model Details](#model-details)
|
16 |
+
2. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
17 |
+
3. [Uses](#uses)
|
18 |
+
4. [Limitations](#limitations)
|
19 |
+
5. [Training](#training)
|
20 |
+
6. [Evaluation Results](#evaluation-results)
|
21 |
+
7. [Environmental Impact](#environmental-impact)
|
22 |
+
8. [Citation Information](#citation-information)
|
23 |
+
|
24 |
+
## Model Details
|
25 |
+
* **Model Description:**
|
26 |
+
KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`). [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)
|
27 |
+
* **Developed by:** Jangwon Park
|
28 |
+
* **Model type:**
|
29 |
+
* **Language(s):** Korean
|
30 |
+
* **License:** Apache 2.0
|
31 |
+
* **Related Models:**
|
32 |
+
* **Resources for more information:** For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).
|
33 |
+
|
34 |
+
## How to Get Started with the Model
|
35 |
|
|
|
36 |
|
37 |
### Load model and tokenizer
|
38 |
|
|
|
74 |
|
75 |
print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
|
76 |
```
|
77 |
+
|
78 |
+
|
79 |
+
## Uses
|
80 |
+
|
81 |
+
#### Direct Use
|
82 |
+
|
83 |
+
#### Misuse, Malicious Use, and Out-of-Scope Use
|
84 |
+
|
85 |
+
|
86 |
+
|
87 |
+
## Limitations and Bias
|
88 |
+
|
89 |
+
#### Limitations
|
90 |
+
|
91 |
+
#### Bias
|
92 |
+
|
93 |
+
|
94 |
+
## Training
|
95 |
+
KoELECTRA is trained with 34GB Korean text,
|
96 |
+
KoELECTRA uses [Wordpiece](https://github.com/monologg/KoELECTRA/blob/master/docs/wordpiece_vocab_EN.md) and model is uploaded on s3.
|
97 |
+
|
98 |
+
### Training Data
|
99 |
+
|
100 |
+
* **Layers:** 12
|
101 |
+
* **Embedding Size:** 768
|
102 |
+
* **Hidden Size:** 768
|
103 |
+
* **Number of heads:** 12
|
104 |
+
|
105 |
+
Vocabulary: “WordPiece” vocabulary was used
|
106 |
+
|
107 |
+
| | Vocab-Length | Do-Lower-Case |
|
108 |
+
|:-:|:-------------:|:----------------:|
|
109 |
+
|V3 | 35000 | False |
|
110 |
+
|
111 |
+
For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)
|
112 |
+
|
113 |
+
### Training Procedure
|
114 |
+
|
115 |
+
#### Pretraining
|
116 |
+
|
117 |
+
* **Batch Size:** 256
|
118 |
+
* **Training Steps:** 1.5M
|
119 |
+
* **LR:** 2e-4
|
120 |
+
* **Max Sequence Length:** 512
|
121 |
+
* **Training Time:** 14 days
|
122 |
+
|
123 |
+
|
124 |
+
## Evaluation
|
125 |
+
|
126 |
+
|
127 |
+
#### Results
|
128 |
+
The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base [in their git repository](https://github.com/monologg/KoELECTRA/blob/master/finetune/README_EN.md)
|
129 |
+
|
130 |
+
This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.
|
131 |
+
|
132 |
+
* **Size:** 421M
|
133 |
+
* **NSMC (acc):** 90.63
|
134 |
+
* **Naver NER (F1):** 88.11
|
135 |
+
* **PAWS (acc):** 84.45
|
136 |
+
* **KorNLI (acc):** 82.24
|
137 |
+
* **KorSTS (spearman):** 85.53
|
138 |
+
* **Question Pair (acc):** 95.25
|
139 |
+
* **KorQuaD (Dev) (EM/F1):** 84.83/93.45
|
140 |
+
* **Korean-Hate-Speech (Dev) (F1):** 67.61
|
141 |
+
|
142 |
+
|
143 |
+
### KoELECTRA v3 (Base Discriminator) Estimated Emissions
|
144 |
+
|
145 |
+
You can estimate carbon emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)
|
146 |
+
|
147 |
+
* **Hardware Type:** TPU v3-8
|
148 |
+
* **Hours used:** 336 hours (14 days)
|
149 |
+
* **Cloud Provider:** GCP (Google Cloud Provider)
|
150 |
+
* **Compute Region:** europe-west4-a
|
151 |
+
* **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 54.2 kg of CO2eq
|
152 |
+
|
153 |
+
|
154 |
+
## Citation
|
155 |
+
```bibtext
|
156 |
+
@misc{park2020koelectra,
|
157 |
+
author = {Park, Jangwon},
|
158 |
+
title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
|
159 |
+
year = {2020},
|
160 |
+
publisher = {GitHub},
|
161 |
+
journal = {GitHub repository},
|
162 |
+
howpublished = {\url{https://github.com/monologg/KoELECTRA}}
|
163 |
+
}
|
164 |
+
|
165 |
+
```
|
166 |
+
|