File size: 2,818 Bytes
eb31b54
93f6812
332eb14
93f6812
eb31b54
 
ea80b69
 
eb31b54
 
93f6812
ceba79f
93f6812
 
eb31b54
 
 
 
 
 
 
 
4ab232f
eb31b54
2dfc11c
eb31b54
2c46680
 
 
2dfc11c
eb31b54
a8c126a
 
5a998c0
5379057
5a998c0
 
 
 
 
 
 
 
 
a8c126a
62e8c0c
eb31b54
2dfc11c
 
eb31b54
4ab232f
2dfc11c
aba094e
37df51c
2dfc11c
eb31b54
768d0d8
 
34307dc
d96765c
eb31b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language:
- ja
license: mit
tags:
- generated_from_trainer
- ner
- bert
metrics:
- f1
widget:
- text: 鈴井は4月の陽気の良い日に、鈴をつけて北海道のトムラウシへと登った
- text: 中国では、中国共産党による一党統治が続く
base_model: xlm-roberta-base
model-index:
- name: xlm-roberta-ner-ja
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# xlm-roberta-ner-japanese

(Japanese caption : 日本語の固有表現抽出のモデル)

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) (pre-trained cross-lingual ```RobertaModel```) trained for named entity recognition (NER) token classification.

The model is fine-tuned on NER dataset provided by Stockmark Inc, in which data is collected from Japanese Wikipedia articles.<br>
See [here](https://github.com/stockmarkteam/ner-wikipedia-dataset) for the license of this dataset.

Each token is labeled by :

| Label id | Tag | Tag in Widget | Description |
|---|---|---|---|
| 0 | O | (None) | others or nothing |
| 1 | PER | PER | person |
| 2 | ORG | ORG | general corporation organization |
| 3 | ORG-P | P | political organization |
| 4 | ORG-O | O | other organization |
| 5 | LOC | LOC | location |
| 6 | INS | INS | institution, facility |
| 7 | PRD | PRD | product |
| 8 | EVT | EVT | event |

## Intended uses

```python
from transformers import pipeline

model_name = "tsmatz/xlm-roberta-ner-japanese"
classifier = pipeline("token-classification", model=model_name)
result = classifier("鈴井は4月の陽気の良い日に、鈴をつけて北海道のトムラウシへと登った")
print(result)
```

## Training procedure

You can download the source code for fine-tuning from [here](https://github.com/tsmatz/huggingface-finetune-japanese/blob/master/01-named-entity.ipynb).

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 12
- eval_batch_size: 12
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| No log        | 1.0   | 446  | 0.1510          | 0.8457 |
| No log        | 2.0   | 892  | 0.0626          | 0.9261 |
| No log        | 3.0   | 1338 | 0.0366          | 0.9580 |
| No log        | 4.0   | 1784 | 0.0196          | 0.9792 |
| No log        | 5.0   | 2230 | 0.0173          | 0.9864 |


### Framework versions

- Transformers 4.23.1
- Pytorch 1.12.1+cu102
- Datasets 2.6.1
- Tokenizers 0.13.1