TTimur commited on
Commit
ce97018
1 Parent(s): 04d53bc

Update Readme

Browse files
Files changed (1) hide show
  1. README.md +155 -53
README.md CHANGED
@@ -11,66 +11,168 @@ metrics:
11
  model-index:
12
  - name: xlm-roberta-base-kyrgyzNER
13
  results: []
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
 
19
- # xlm-roberta-base-kyrgyzNER
20
 
21
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.3229
24
- - Precision: 0.7042
25
- - Recall: 0.6871
26
- - F1: 0.6956
27
- - Accuracy: 0.9119
28
-
29
- ## Model description
30
-
31
- More information needed
32
-
33
- ## Intended uses & limitations
34
-
35
- More information needed
36
 
37
- ## Training and evaluation data
38
 
39
- More information needed
 
 
40
 
41
- ## Training procedure
42
-
43
- ### Training hyperparameters
44
-
45
- The following hyperparameters were used during training:
46
- - learning_rate: 1e-05
47
- - train_batch_size: 64
48
- - eval_batch_size: 64
49
- - seed: 1
50
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
51
- - lr_scheduler_type: linear
52
- - lr_scheduler_warmup_steps: 800
53
- - num_epochs: 10
54
-
55
- ### Training results
56
-
57
- | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
58
- |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
59
- | No log | 1.0 | 88 | 3.8135 | 0.0001 | 0.0009 | 0.0002 | 0.0002 |
60
- | No log | 2.0 | 176 | 1.3867 | 0.0 | 0.0 | 0.0 | 0.7162 |
61
- | No log | 3.0 | 264 | 0.9276 | 0.1684 | 0.0357 | 0.0590 | 0.7678 |
62
- | No log | 4.0 | 352 | 0.6467 | 0.4470 | 0.2771 | 0.3421 | 0.8420 |
63
- | No log | 5.0 | 440 | 0.5200 | 0.6394 | 0.5282 | 0.5785 | 0.8792 |
64
- | 1.647 | 6.0 | 528 | 0.4383 | 0.6712 | 0.5918 | 0.6290 | 0.8927 |
65
- | 1.647 | 7.0 | 616 | 0.3847 | 0.6724 | 0.6439 | 0.6578 | 0.9028 |
66
- | 1.647 | 8.0 | 704 | 0.3586 | 0.6857 | 0.6575 | 0.6713 | 0.9061 |
67
- | 1.647 | 9.0 | 792 | 0.3422 | 0.6786 | 0.6717 | 0.6751 | 0.9070 |
68
- | 1.647 | 10.0 | 880 | 0.3229 | 0.7042 | 0.6871 | 0.6956 | 0.9119 |
69
 
 
 
 
 
 
 
 
70
 
71
- ### Framework versions
72
 
73
- - Transformers 4.35.2
74
- - Pytorch 2.1.0+cu121
75
- - Datasets 2.17.0
76
- - Tokenizers 0.15.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  model-index:
12
  - name: xlm-roberta-base-kyrgyzNER
13
  results: []
14
+ language:
15
+ - ky
16
  ---
17
 
 
 
18
 
 
19
 
20
+ # kyrgyzNER model (xlm-roberta-base) by The_Cramer_Project
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
 
22
 
23
+ - The original repository: https://github.com/Akyl-AI/KyrgyzNER
24
+ - Paper will be uploaded soon
25
+ - KyrgyzNER dataset and Codes will be uploaded soon
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the KyrgyzNER dataset.
29
+ It achieves the following results on the evaluation set:
30
+ - Loss: 0.3273
31
+ - Precision: 0.7090
32
+ - Recall: 0.6946
33
+ - F1: 0.7017
34
+ - Accuracy: 0.9119
35
 
 
36
 
37
+ ## How to use
38
+ You can use this model with the Transformers pipeline for NER.
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
42
+ from transformers import pipeline
43
+
44
+ id2label = {
45
+ 'LABEL_0': 'B-NATIONAL',
46
+ 'LABEL_1': 'I-PLANT',
47
+ 'LABEL_2': 'I-ORGANISATION',
48
+ 'LABEL_3': 'B-ORGANISATION',
49
+ 'LABEL_4': 'B-MEDIA',
50
+ 'LABEL_5': 'I-ARTIFACT',
51
+ 'LABEL_6': 'B-AWARD',
52
+ 'LABEL_7': 'B-UNKNOWN',
53
+ 'LABEL_8': 'I-LOCATION',
54
+ 'LABEL_9': 'B-PERSON',
55
+ 'LABEL_10': 'I-LEGAL',
56
+ 'LABEL_11': 'B-BUSINESS',
57
+ 'LABEL_12': 'B-ACRONYM',
58
+ 'LABEL_13': 'I-PERIOD',
59
+ 'LABEL_14': 'B-INSTITUTION',
60
+ 'LABEL_15': 'I-MEASURE',
61
+ 'LABEL_16': 'B-CREATION',
62
+ 'LABEL_17': 'I-ACRONYM',
63
+ 'LABEL_18': 'I-AWARD',
64
+ 'LABEL_19': 'I-WEBSITE',
65
+ 'LABEL_20': 'B-PERIOD',
66
+ 'LABEL_21': 'I-PERSON',
67
+ 'LABEL_22': 'I-PERSON_TYPE',
68
+ 'LABEL_23': 'B-SUBSTANCE',
69
+ 'LABEL_24': 'O',
70
+ 'LABEL_25': 'B-PLANT',
71
+ 'LABEL_26': 'I-INSTITUTION',
72
+ 'LABEL_27': 'I-SUBSTANCE',
73
+ 'LABEL_28': 'I-INSTALLATION',
74
+ 'LABEL_29': 'B-CONCEPT',
75
+ 'LABEL_30': 'B-TITLE',
76
+ 'LABEL_31': 'I-EVENT',
77
+ 'LABEL_32': 'B-ARTIFACT',
78
+ 'LABEL_33': 'B-MEASURE',
79
+ 'LABEL_34': 'B-LOCATION',
80
+ 'LABEL_35': 'I-BUSINESS',
81
+ 'LABEL_36': 'B-ANIMAL',
82
+ 'LABEL_37': 'B-PERSON_TYPE',
83
+ 'LABEL_38': 'B-INSTALLATION',
84
+ 'LABEL_39': 'I-TITLE',
85
+ 'LABEL_40': 'B-IDENTIFIER',
86
+ 'LABEL_41': 'I-IDENTIFIER',
87
+ 'LABEL_42': 'B-LEGAL',
88
+ 'LABEL_43': 'I-MEDIA',
89
+ 'LABEL_44': 'I-CONCEPT',
90
+ 'LABEL_45': 'I-UNKNOWN',
91
+ 'LABEL_46': 'B-EVENT',
92
+ 'LABEL_47': 'B-WEBSITE',
93
+ 'LABEL_48': 'I-NATIONAL',
94
+ 'LABEL_49': 'I-CREATION',
95
+ 'LABEL_50': 'I-ANIMAL'}
96
+
97
+ model_ckpt = "TTimur/xlm-roberta-base-kyrgyzNER"
98
+
99
+ config = AutoConfig.from_pretrained(model_ckpt)
100
+ tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
101
+ model = AutoModelForTokenClassification.from_pretrained(model_ckpt, config = config)
102
+
103
+ # aggregation_strategy = "none"
104
+ nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
105
+
106
+ example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
107
+ ner_results = nlp(example)
108
+ for result in ner_results:
109
+ result.update({'entity': id2label[result['entity']]})
110
+ print(result)
111
+
112
+ # output:
113
+ # {'entity': 'B-LOCATION', 'score': 0.95103735, 'index': 1, 'word': '▁Кыргызстан', 'start': 0, 'end': 10}
114
+ # {'entity': 'B-LOCATION', 'score': 0.79447913, 'index': 2, 'word': '▁Ор', 'start': 11, 'end': 13}
115
+ # {'entity': 'I-LOCATION', 'score': 0.8703734, 'index': 3, 'word': 'то', 'start': 13, 'end': 15}
116
+ # {'entity': 'I-LOCATION', 'score': 0.942387, 'index': 4, 'word': '▁Азия', 'start': 16, 'end': 20}
117
+ # {'entity': 'I-LOCATION', 'score': 0.8542615, 'index': 5, 'word': 'нын', 'start': 20, 'end': 23}
118
+ # {'entity': 'I-LOCATION', 'score': 0.70930535, 'index': 6, 'word': '▁түн', 'start': 24, 'end': 27}
119
+ # {'entity': 'I-LOCATION', 'score': 0.6540094, 'index': 7, 'word': 'дүк', 'start': 27, 'end': 30}
120
+ # {'entity': 'I-LOCATION', 'score': 0.63446337, 'index': 8, 'word': '-', 'start': 30, 'end': 31}
121
+ # {'entity': 'I-LOCATION', 'score': 0.6204858, 'index': 9, 'word': 'чы', 'start': 31, 'end': 33}
122
+ # {'entity': 'I-LOCATION', 'score': 0.6786872, 'index': 10, 'word': 'г', 'start': 33, 'end': 34}
123
+ # {'entity': 'I-LOCATION', 'score': 0.64190257, 'index': 11, 'word': 'ыш', 'start': 34, 'end': 36}
124
+ # {'entity': 'O', 'score': 0.64438057, 'index': 12, 'word': 'ында', 'start': 36, 'end': 40}
125
+ # {'entity': 'O', 'score': 0.9916931, 'index': 13, 'word': '▁орун', 'start': 41, 'end': 45}
126
+ # {'entity': 'O', 'score': 0.9953047, 'index': 14, 'word': '▁алган', 'start': 46, 'end': 51}
127
+ # {'entity': 'O', 'score': 0.9901377, 'index': 15, 'word': '▁мамлекет', 'start': 52, 'end': 60}
128
+ # {'entity': 'O', 'score': 0.99605453, 'index': 16, 'word': '.', 'start': 60, 'end': 61}
129
+
130
+
131
+ token = ""
132
+ label_list = []
133
+ token_list = []
134
+
135
+ for result in ner_results:
136
+ if result["word"].startswith("▁"):
137
+ if token:
138
+ token_list.append(token.replace("▁", ""))
139
+ token = result["word"]
140
+ label_list.append(result["entity"])
141
+ else:
142
+ token += result["word"]
143
+
144
+ token_list.append(token.replace("▁", ""))
145
+
146
+ for token, label in zip(token_list, label_list):
147
+ print(f"{token}\t{label}")
148
+
149
+
150
+ # output:
151
+ # Кыргызстан B-LOCATION
152
+ # Орто B-LOCATION
153
+ # Азиянын I-LOCATION
154
+ # түндүк-чыгышында I-LOCATION
155
+ # орун O
156
+ # алган O
157
+ # мамлекет. O
158
+
159
+ # aggregation_strategy = "simple"
160
+ nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
161
+ example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
162
+
163
+ ner_results = nlp(example)
164
+ for result in ner_results:
165
+ result.update({'entity_group': id2label[result['entity_group']]})
166
+ print(result)
167
+
168
+ # output:
169
+ # {'entity_group': 'B-LOCATION', 'score': 0.87275827, 'word': 'Кыргызстан Ор', 'start': 0, 'end': 13}
170
+ # {'entity_group': 'I-LOCATION', 'score': 0.73398614, 'word': 'то Азиянын түндүк-чыгыш', 'start': 13, 'end': 36}
171
+ # {'entity_group': 'O', 'score': 0.92351407, 'word': 'ында орун алган мамлекет.', 'start': 36, 'end': 61}
172
+
173
+ ```
174
+
175
+
176
+ # NE classes
177
+
178
+ **PERSON**, **LOCATION** , **MEASURE** , **INSTITUTION** , **PERIOD** , **ORGANISATION** , **MEDIA** , **TITLE** , **BUSINESS** , **LEGAL** , **EVENT** , **ARTIFACT** , **INSTALLATION** , **PERSON_TYPE**, **NATIONAL**, **CONCEPT**, **CREATION**, **WEBSITE**, **SUBSTANCE**, **ACRONYM**, **IDENTIFIER**, **UNKNOWN**, **AWARD**, **ANIMAL**