Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/indolem/indobert-base-uncased/README.md
README.md
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: id
|
3 |
+
tags:
|
4 |
+
- indobert
|
5 |
+
- indolem
|
6 |
+
license: mit
|
7 |
+
inference: false
|
8 |
+
datasets:
|
9 |
+
- 220M words (IndoWiki, IndoWC, News)
|
10 |
+
---
|
11 |
+
|
12 |
+
## About
|
13 |
+
|
14 |
+
[IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:
|
15 |
+
* Indonesian Wikipedia (74M words)
|
16 |
+
* news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
|
17 |
+
* an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).
|
18 |
+
|
19 |
+
We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being <b>3.97</b> (similar to English BERT-base).
|
20 |
+
|
21 |
+
This <b>IndoBERT</b> was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
|
22 |
+
|
23 |
+
| Task | Metric | Bi-LSTM | mBERT | MalayBERT | IndoBERT |
|
24 |
+
| ---- | ---- | ---- | ---- | ---- | ---- |
|
25 |
+
| POS Tagging | Acc | 95.4 | <b>96.8</b> | <b>96.8</b> | <b>96.8</b> |
|
26 |
+
| NER UGM | F1| 70.9 | 71.6 | 73.2 | <b>74.9</b> |
|
27 |
+
| NER UI | F1 | 82.2 | 82.2 | 87.4 | <b>90.1</b> |
|
28 |
+
| Dep. Parsing (UD-Indo-GSD) | UAS/LAS | 85.25/80.35 | 86.85/81.78 | 86.99/81.87 | <b>87.12<b/>/<b>82.32</b> |
|
29 |
+
| Dep. Parsing (UD-Indo-PUD) | UAS/LAS | 84.04/79.01 | <b>90.58</b>/<b>85.44</b> | 88.91/83.56 | 89.23/83.95 |
|
30 |
+
| Sentiment Analysis | F1 | 71.62 | 76.58 | 82.02 | <b>84.13</b> |
|
31 |
+
| Summarization | R1/R2/RL | 67.96/61.65/67.24 | 68.40/61.66/67.67 | 68.44/61.38/67.71 | <b>69.93</b>/<b>62.86</b>/<b>69.21</b> |
|
32 |
+
| Next Tweet Prediction | Acc | 73.6 | 92.4 | 93.1 | <b>93.7</b> |
|
33 |
+
| Tweet Ordering | Spearman corr. | 0.45 | 0.53 | 0.51 | <b>0.59</b> |
|
34 |
+
|
35 |
+
The paper is published at the 28th COLING 2020. Please refer to https://indolem.github.io for more details about the benchmarks.
|
36 |
+
|
37 |
+
## How to use
|
38 |
+
|
39 |
+
### Load model and tokenizer (tested with transformers==3.5.1)
|
40 |
+
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModel
|
42 |
+
tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased")
|
43 |
+
model = AutoModel.from_pretrained("indolem/indobert-base-uncased")
|
44 |
+
```
|
45 |
+
|
46 |
+
## Citation
|
47 |
+
If you use our work, please cite:
|
48 |
+
|
49 |
+
```bibtex
|
50 |
+
@inproceedings{koto2020indolem,
|
51 |
+
title={IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP},
|
52 |
+
author={Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin},
|
53 |
+
booktitle={Proceedings of the 28th COLING},
|
54 |
+
year={2020}
|
55 |
+
}
|
56 |
+
```
|