Update Readme
Browse files
README.md
CHANGED
@@ -11,66 +11,168 @@ metrics:
|
|
11 |
model-index:
|
12 |
- name: xlm-roberta-base-kyrgyzNER
|
13 |
results: []
|
|
|
|
|
14 |
---
|
15 |
|
16 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
17 |
-
should probably proofread and complete it, then remove this comment. -->
|
18 |
|
19 |
-
# xlm-roberta-base-kyrgyzNER
|
20 |
|
21 |
-
|
22 |
-
It achieves the following results on the evaluation set:
|
23 |
-
- Loss: 0.3229
|
24 |
-
- Precision: 0.7042
|
25 |
-
- Recall: 0.6871
|
26 |
-
- F1: 0.6956
|
27 |
-
- Accuracy: 0.9119
|
28 |
-
|
29 |
-
## Model description
|
30 |
-
|
31 |
-
More information needed
|
32 |
-
|
33 |
-
## Intended uses & limitations
|
34 |
-
|
35 |
-
More information needed
|
36 |
|
37 |
-
## Training and evaluation data
|
38 |
|
39 |
-
|
|
|
|
|
40 |
|
41 |
-
## Training procedure
|
42 |
-
|
43 |
-
### Training hyperparameters
|
44 |
-
|
45 |
-
The following hyperparameters were used during training:
|
46 |
-
- learning_rate: 1e-05
|
47 |
-
- train_batch_size: 64
|
48 |
-
- eval_batch_size: 64
|
49 |
-
- seed: 1
|
50 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
51 |
-
- lr_scheduler_type: linear
|
52 |
-
- lr_scheduler_warmup_steps: 800
|
53 |
-
- num_epochs: 10
|
54 |
-
|
55 |
-
### Training results
|
56 |
-
|
57 |
-
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|
58 |
-
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
|
59 |
-
| No log | 1.0 | 88 | 3.8135 | 0.0001 | 0.0009 | 0.0002 | 0.0002 |
|
60 |
-
| No log | 2.0 | 176 | 1.3867 | 0.0 | 0.0 | 0.0 | 0.7162 |
|
61 |
-
| No log | 3.0 | 264 | 0.9276 | 0.1684 | 0.0357 | 0.0590 | 0.7678 |
|
62 |
-
| No log | 4.0 | 352 | 0.6467 | 0.4470 | 0.2771 | 0.3421 | 0.8420 |
|
63 |
-
| No log | 5.0 | 440 | 0.5200 | 0.6394 | 0.5282 | 0.5785 | 0.8792 |
|
64 |
-
| 1.647 | 6.0 | 528 | 0.4383 | 0.6712 | 0.5918 | 0.6290 | 0.8927 |
|
65 |
-
| 1.647 | 7.0 | 616 | 0.3847 | 0.6724 | 0.6439 | 0.6578 | 0.9028 |
|
66 |
-
| 1.647 | 8.0 | 704 | 0.3586 | 0.6857 | 0.6575 | 0.6713 | 0.9061 |
|
67 |
-
| 1.647 | 9.0 | 792 | 0.3422 | 0.6786 | 0.6717 | 0.6751 | 0.9070 |
|
68 |
-
| 1.647 | 10.0 | 880 | 0.3229 | 0.7042 | 0.6871 | 0.6956 | 0.9119 |
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
-
### Framework versions
|
72 |
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
model-index:
|
12 |
- name: xlm-roberta-base-kyrgyzNER
|
13 |
results: []
|
14 |
+
language:
|
15 |
+
- ky
|
16 |
---
|
17 |
|
|
|
|
|
18 |
|
|
|
19 |
|
20 |
+
# kyrgyzNER model (xlm-roberta-base) by The_Cramer_Project
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
|
|
22 |
|
23 |
+
- The original repository: https://github.com/Akyl-AI/KyrgyzNER
|
24 |
+
- Paper will be uploaded soon
|
25 |
+
- KyrgyzNER dataset and Codes will be uploaded soon
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
+
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the KyrgyzNER dataset.
|
29 |
+
It achieves the following results on the evaluation set:
|
30 |
+
- Loss: 0.3273
|
31 |
+
- Precision: 0.7090
|
32 |
+
- Recall: 0.6946
|
33 |
+
- F1: 0.7017
|
34 |
+
- Accuracy: 0.9119
|
35 |
|
|
|
36 |
|
37 |
+
## How to use
|
38 |
+
You can use this model with the Transformers pipeline for NER.
|
39 |
+
|
40 |
+
```python
|
41 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
|
42 |
+
from transformers import pipeline
|
43 |
+
|
44 |
+
id2label = {
|
45 |
+
'LABEL_0': 'B-NATIONAL',
|
46 |
+
'LABEL_1': 'I-PLANT',
|
47 |
+
'LABEL_2': 'I-ORGANISATION',
|
48 |
+
'LABEL_3': 'B-ORGANISATION',
|
49 |
+
'LABEL_4': 'B-MEDIA',
|
50 |
+
'LABEL_5': 'I-ARTIFACT',
|
51 |
+
'LABEL_6': 'B-AWARD',
|
52 |
+
'LABEL_7': 'B-UNKNOWN',
|
53 |
+
'LABEL_8': 'I-LOCATION',
|
54 |
+
'LABEL_9': 'B-PERSON',
|
55 |
+
'LABEL_10': 'I-LEGAL',
|
56 |
+
'LABEL_11': 'B-BUSINESS',
|
57 |
+
'LABEL_12': 'B-ACRONYM',
|
58 |
+
'LABEL_13': 'I-PERIOD',
|
59 |
+
'LABEL_14': 'B-INSTITUTION',
|
60 |
+
'LABEL_15': 'I-MEASURE',
|
61 |
+
'LABEL_16': 'B-CREATION',
|
62 |
+
'LABEL_17': 'I-ACRONYM',
|
63 |
+
'LABEL_18': 'I-AWARD',
|
64 |
+
'LABEL_19': 'I-WEBSITE',
|
65 |
+
'LABEL_20': 'B-PERIOD',
|
66 |
+
'LABEL_21': 'I-PERSON',
|
67 |
+
'LABEL_22': 'I-PERSON_TYPE',
|
68 |
+
'LABEL_23': 'B-SUBSTANCE',
|
69 |
+
'LABEL_24': 'O',
|
70 |
+
'LABEL_25': 'B-PLANT',
|
71 |
+
'LABEL_26': 'I-INSTITUTION',
|
72 |
+
'LABEL_27': 'I-SUBSTANCE',
|
73 |
+
'LABEL_28': 'I-INSTALLATION',
|
74 |
+
'LABEL_29': 'B-CONCEPT',
|
75 |
+
'LABEL_30': 'B-TITLE',
|
76 |
+
'LABEL_31': 'I-EVENT',
|
77 |
+
'LABEL_32': 'B-ARTIFACT',
|
78 |
+
'LABEL_33': 'B-MEASURE',
|
79 |
+
'LABEL_34': 'B-LOCATION',
|
80 |
+
'LABEL_35': 'I-BUSINESS',
|
81 |
+
'LABEL_36': 'B-ANIMAL',
|
82 |
+
'LABEL_37': 'B-PERSON_TYPE',
|
83 |
+
'LABEL_38': 'B-INSTALLATION',
|
84 |
+
'LABEL_39': 'I-TITLE',
|
85 |
+
'LABEL_40': 'B-IDENTIFIER',
|
86 |
+
'LABEL_41': 'I-IDENTIFIER',
|
87 |
+
'LABEL_42': 'B-LEGAL',
|
88 |
+
'LABEL_43': 'I-MEDIA',
|
89 |
+
'LABEL_44': 'I-CONCEPT',
|
90 |
+
'LABEL_45': 'I-UNKNOWN',
|
91 |
+
'LABEL_46': 'B-EVENT',
|
92 |
+
'LABEL_47': 'B-WEBSITE',
|
93 |
+
'LABEL_48': 'I-NATIONAL',
|
94 |
+
'LABEL_49': 'I-CREATION',
|
95 |
+
'LABEL_50': 'I-ANIMAL'}
|
96 |
+
|
97 |
+
model_ckpt = "TTimur/xlm-roberta-base-kyrgyzNER"
|
98 |
+
|
99 |
+
config = AutoConfig.from_pretrained(model_ckpt)
|
100 |
+
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
|
101 |
+
model = AutoModelForTokenClassification.from_pretrained(model_ckpt, config = config)
|
102 |
+
|
103 |
+
# aggregation_strategy = "none"
|
104 |
+
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
|
105 |
+
|
106 |
+
example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
|
107 |
+
ner_results = nlp(example)
|
108 |
+
for result in ner_results:
|
109 |
+
result.update({'entity': id2label[result['entity']]})
|
110 |
+
print(result)
|
111 |
+
|
112 |
+
# output:
|
113 |
+
# {'entity': 'B-LOCATION', 'score': 0.95103735, 'index': 1, 'word': '▁Кыргызстан', 'start': 0, 'end': 10}
|
114 |
+
# {'entity': 'B-LOCATION', 'score': 0.79447913, 'index': 2, 'word': '▁Ор', 'start': 11, 'end': 13}
|
115 |
+
# {'entity': 'I-LOCATION', 'score': 0.8703734, 'index': 3, 'word': 'то', 'start': 13, 'end': 15}
|
116 |
+
# {'entity': 'I-LOCATION', 'score': 0.942387, 'index': 4, 'word': '▁Азия', 'start': 16, 'end': 20}
|
117 |
+
# {'entity': 'I-LOCATION', 'score': 0.8542615, 'index': 5, 'word': 'нын', 'start': 20, 'end': 23}
|
118 |
+
# {'entity': 'I-LOCATION', 'score': 0.70930535, 'index': 6, 'word': '▁түн', 'start': 24, 'end': 27}
|
119 |
+
# {'entity': 'I-LOCATION', 'score': 0.6540094, 'index': 7, 'word': 'дүк', 'start': 27, 'end': 30}
|
120 |
+
# {'entity': 'I-LOCATION', 'score': 0.63446337, 'index': 8, 'word': '-', 'start': 30, 'end': 31}
|
121 |
+
# {'entity': 'I-LOCATION', 'score': 0.6204858, 'index': 9, 'word': 'чы', 'start': 31, 'end': 33}
|
122 |
+
# {'entity': 'I-LOCATION', 'score': 0.6786872, 'index': 10, 'word': 'г', 'start': 33, 'end': 34}
|
123 |
+
# {'entity': 'I-LOCATION', 'score': 0.64190257, 'index': 11, 'word': 'ыш', 'start': 34, 'end': 36}
|
124 |
+
# {'entity': 'O', 'score': 0.64438057, 'index': 12, 'word': 'ында', 'start': 36, 'end': 40}
|
125 |
+
# {'entity': 'O', 'score': 0.9916931, 'index': 13, 'word': '▁орун', 'start': 41, 'end': 45}
|
126 |
+
# {'entity': 'O', 'score': 0.9953047, 'index': 14, 'word': '▁алган', 'start': 46, 'end': 51}
|
127 |
+
# {'entity': 'O', 'score': 0.9901377, 'index': 15, 'word': '▁мамлекет', 'start': 52, 'end': 60}
|
128 |
+
# {'entity': 'O', 'score': 0.99605453, 'index': 16, 'word': '.', 'start': 60, 'end': 61}
|
129 |
+
|
130 |
+
|
131 |
+
token = ""
|
132 |
+
label_list = []
|
133 |
+
token_list = []
|
134 |
+
|
135 |
+
for result in ner_results:
|
136 |
+
if result["word"].startswith("▁"):
|
137 |
+
if token:
|
138 |
+
token_list.append(token.replace("▁", ""))
|
139 |
+
token = result["word"]
|
140 |
+
label_list.append(result["entity"])
|
141 |
+
else:
|
142 |
+
token += result["word"]
|
143 |
+
|
144 |
+
token_list.append(token.replace("▁", ""))
|
145 |
+
|
146 |
+
for token, label in zip(token_list, label_list):
|
147 |
+
print(f"{token}\t{label}")
|
148 |
+
|
149 |
+
|
150 |
+
# output:
|
151 |
+
# Кыргызстан B-LOCATION
|
152 |
+
# Орто B-LOCATION
|
153 |
+
# Азиянын I-LOCATION
|
154 |
+
# түндүк-чыгышында I-LOCATION
|
155 |
+
# орун O
|
156 |
+
# алган O
|
157 |
+
# мамлекет. O
|
158 |
+
|
159 |
+
# aggregation_strategy = "simple"
|
160 |
+
nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
|
161 |
+
example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
|
162 |
+
|
163 |
+
ner_results = nlp(example)
|
164 |
+
for result in ner_results:
|
165 |
+
result.update({'entity_group': id2label[result['entity_group']]})
|
166 |
+
print(result)
|
167 |
+
|
168 |
+
# output:
|
169 |
+
# {'entity_group': 'B-LOCATION', 'score': 0.87275827, 'word': 'Кыргызстан Ор', 'start': 0, 'end': 13}
|
170 |
+
# {'entity_group': 'I-LOCATION', 'score': 0.73398614, 'word': 'то Азиянын түндүк-чыгыш', 'start': 13, 'end': 36}
|
171 |
+
# {'entity_group': 'O', 'score': 0.92351407, 'word': 'ында орун алган мамлекет.', 'start': 36, 'end': 61}
|
172 |
+
|
173 |
+
```
|
174 |
+
|
175 |
+
|
176 |
+
# NE classes
|
177 |
+
|
178 |
+
**PERSON**, **LOCATION** , **MEASURE** , **INSTITUTION** , **PERIOD** , **ORGANISATION** , **MEDIA** , **TITLE** , **BUSINESS** , **LEGAL** , **EVENT** , **ARTIFACT** , **INSTALLATION** , **PERSON_TYPE**, **NATIONAL**, **CONCEPT**, **CREATION**, **WEBSITE**, **SUBSTANCE**, **ACRONYM**, **IDENTIFIER**, **UNKNOWN**, **AWARD**, **ANIMAL**
|