TTimur
/

xlm-roberta-base-kyrgyzNER

@@ -11,66 +11,168 @@ metrics:
 model-index:
 - name: xlm-roberta-base-kyrgyzNER
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# xlm-roberta-base-kyrgyzNER
-This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3229
-- Precision: 0.7042
-- Recall: 0.6871
-- F1: 0.6956
-- Accuracy: 0.9119
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 64
-- eval_batch_size: 64
-- seed: 1
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 800
-- num_epochs: 10
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
-|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
-| No log        | 1.0   | 88   | 3.8135          | 0.0001    | 0.0009 | 0.0002 | 0.0002   |
-| No log        | 2.0   | 176  | 1.3867          | 0.0       | 0.0    | 0.0    | 0.7162   |
-| No log        | 3.0   | 264  | 0.9276          | 0.1684    | 0.0357 | 0.0590 | 0.7678   |
-| No log        | 4.0   | 352  | 0.6467          | 0.4470    | 0.2771 | 0.3421 | 0.8420   |
-| No log        | 5.0   | 440  | 0.5200          | 0.6394    | 0.5282 | 0.5785 | 0.8792   |
-| 1.647         | 6.0   | 528  | 0.4383          | 0.6712    | 0.5918 | 0.6290 | 0.8927   |
-| 1.647         | 7.0   | 616  | 0.3847          | 0.6724    | 0.6439 | 0.6578 | 0.9028   |
-| 1.647         | 8.0   | 704  | 0.3586          | 0.6857    | 0.6575 | 0.6713 | 0.9061   |
-| 1.647         | 9.0   | 792  | 0.3422          | 0.6786    | 0.6717 | 0.6751 | 0.9070   |
-| 1.647         | 10.0  | 880  | 0.3229          | 0.7042    | 0.6871 | 0.6956 | 0.9119   |
-### Framework versions
-- Transformers 4.35.2
-- Pytorch 2.1.0+cu121
-- Datasets 2.17.0
-- Tokenizers 0.15.2

 model-index:
 - name: xlm-roberta-base-kyrgyzNER
   results: []
+language:
+- ky
 ---
+# kyrgyzNER model (xlm-roberta-base) by The_Cramer_Project
+- The original repository: https://github.com/Akyl-AI/KyrgyzNER
+- Paper will be uploaded soon
+- KyrgyzNER dataset and Codes will be uploaded soon
+This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the KyrgyzNER dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.3273
+- Precision: 0.7090
+- Recall: 0.6946
+- F1: 0.7017
+- Accuracy: 0.9119
+## How to use
+You can use this model with the Transformers pipeline for NER.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoConfig
+from transformers import pipeline
+id2label = {
+'LABEL_0': 'B-NATIONAL',
+ 'LABEL_1': 'I-PLANT',
+ 'LABEL_2': 'I-ORGANISATION',
+ 'LABEL_3': 'B-ORGANISATION',
+ 'LABEL_4': 'B-MEDIA',
+ 'LABEL_5': 'I-ARTIFACT',
+ 'LABEL_6': 'B-AWARD',
+ 'LABEL_7': 'B-UNKNOWN',
+ 'LABEL_8': 'I-LOCATION',
+ 'LABEL_9': 'B-PERSON',
+ 'LABEL_10': 'I-LEGAL',
+ 'LABEL_11': 'B-BUSINESS',
+ 'LABEL_12': 'B-ACRONYM',
+ 'LABEL_13': 'I-PERIOD',
+ 'LABEL_14': 'B-INSTITUTION',
+ 'LABEL_15': 'I-MEASURE',
+ 'LABEL_16': 'B-CREATION',
+ 'LABEL_17': 'I-ACRONYM',
+ 'LABEL_18': 'I-AWARD',
+ 'LABEL_19': 'I-WEBSITE',
+ 'LABEL_20': 'B-PERIOD',
+ 'LABEL_21': 'I-PERSON',
+ 'LABEL_22': 'I-PERSON_TYPE',
+ 'LABEL_23': 'B-SUBSTANCE',
+ 'LABEL_24': 'O',
+ 'LABEL_25': 'B-PLANT',
+ 'LABEL_26': 'I-INSTITUTION',
+ 'LABEL_27': 'I-SUBSTANCE',
+ 'LABEL_28': 'I-INSTALLATION',
+ 'LABEL_29': 'B-CONCEPT',
+ 'LABEL_30': 'B-TITLE',
+ 'LABEL_31': 'I-EVENT',
+ 'LABEL_32': 'B-ARTIFACT',
+ 'LABEL_33': 'B-MEASURE',
+ 'LABEL_34': 'B-LOCATION',
+ 'LABEL_35': 'I-BUSINESS',
+ 'LABEL_36': 'B-ANIMAL',
+ 'LABEL_37': 'B-PERSON_TYPE',
+ 'LABEL_38': 'B-INSTALLATION',
+ 'LABEL_39': 'I-TITLE',
+ 'LABEL_40': 'B-IDENTIFIER',
+ 'LABEL_41': 'I-IDENTIFIER',
+ 'LABEL_42': 'B-LEGAL',
+ 'LABEL_43': 'I-MEDIA',
+ 'LABEL_44': 'I-CONCEPT',
+ 'LABEL_45': 'I-UNKNOWN',
+ 'LABEL_46': 'B-EVENT',
+ 'LABEL_47': 'B-WEBSITE',
+ 'LABEL_48': 'I-NATIONAL',
+ 'LABEL_49': 'I-CREATION',
+ 'LABEL_50': 'I-ANIMAL'}
+model_ckpt = "TTimur/xlm-roberta-base-kyrgyzNER"
+config = AutoConfig.from_pretrained(model_ckpt)
+tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
+model = AutoModelForTokenClassification.from_pretrained(model_ckpt, config = config)
+# aggregation_strategy = "none"
+nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "none")
+example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
+ner_results = nlp(example)
+for result in ner_results:
+    result.update({'entity': id2label[result['entity']]})
+    print(result)
+# output:
+# {'entity': 'B-LOCATION', 'score': 0.95103735, 'index': 1, 'word': '▁Кыргызстан', 'start': 0, 'end': 10}
+# {'entity': 'B-LOCATION', 'score': 0.79447913, 'index': 2, 'word': '▁Ор', 'start': 11, 'end': 13}
+# {'entity': 'I-LOCATION', 'score': 0.8703734, 'index': 3, 'word': 'то', 'start': 13, 'end': 15}
+# {'entity': 'I-LOCATION', 'score': 0.942387, 'index': 4, 'word': '▁Азия', 'start': 16, 'end': 20}
+# {'entity': 'I-LOCATION', 'score': 0.8542615, 'index': 5, 'word': 'нын', 'start': 20, 'end': 23}
+# {'entity': 'I-LOCATION', 'score': 0.70930535, 'index': 6, 'word': '▁түн', 'start': 24, 'end': 27}
+# {'entity': 'I-LOCATION', 'score': 0.6540094, 'index': 7, 'word': 'дүк', 'start': 27, 'end': 30}
+# {'entity': 'I-LOCATION', 'score': 0.63446337, 'index': 8, 'word': '-', 'start': 30, 'end': 31}
+# {'entity': 'I-LOCATION', 'score': 0.6204858, 'index': 9, 'word': 'чы', 'start': 31, 'end': 33}
+# {'entity': 'I-LOCATION', 'score': 0.6786872, 'index': 10, 'word': 'г', 'start': 33, 'end': 34}
+# {'entity': 'I-LOCATION', 'score': 0.64190257, 'index': 11, 'word': 'ыш', 'start': 34, 'end': 36}
+# {'entity': 'O', 'score': 0.64438057, 'index': 12, 'word': 'ында', 'start': 36, 'end': 40}
+# {'entity': 'O', 'score': 0.9916931, 'index': 13, 'word': '▁орун', 'start': 41, 'end': 45}
+# {'entity': 'O', 'score': 0.9953047, 'index': 14, 'word': '▁алган', 'start': 46, 'end': 51}
+# {'entity': 'O', 'score': 0.9901377, 'index': 15, 'word': '▁мамлекет', 'start': 52, 'end': 60}
+# {'entity': 'O', 'score': 0.99605453, 'index': 16, 'word': '.', 'start': 60, 'end': 61}
+token = ""
+label_list = []
+token_list = []
+for result in ner_results:
+    if result["word"].startswith("▁"):
+        if token:
+            token_list.append(token.replace("▁", ""))
+        token = result["word"]
+        label_list.append(result["entity"])
+    else:
+        token += result["word"]
+token_list.append(token.replace("▁", ""))
+for token, label in zip(token_list, label_list):
+    print(f"{token}\t{label}")
+# output:
+# Кыргызстан    B-LOCATION
+# Орто  B-LOCATION
+# Азиянын   I-LOCATION
+# түндүк-чыгышында  I-LOCATION
+# орун  O
+# алган O
+# мамлекет. O
+# aggregation_strategy = "simple"
+nlp = pipeline("ner", model = model, tokenizer = tokenizer, aggregation_strategy = "simple")
+example = "Кыргызстан Орто Азиянын түндүк-чыгышында орун алган мамлекет."
+ner_results = nlp(example)
+for result in ner_results:
+    result.update({'entity_group': id2label[result['entity_group']]})
+    print(result)
+# output:
+# {'entity_group': 'B-LOCATION', 'score': 0.87275827, 'word': 'Кыргызстан Ор', 'start': 0, 'end': 13}
+# {'entity_group': 'I-LOCATION', 'score': 0.73398614, 'word': 'то Азиянын түндүк-чыгыш', 'start': 13, 'end': 36}
+# {'entity_group': 'O', 'score': 0.92351407, 'word': 'ында орун алган мамлекет.', 'start': 36, 'end': 61}
+```
+# NE classes
+**PERSON**, **LOCATION** , **MEASURE** , **INSTITUTION** , **PERIOD** , **ORGANISATION** , **MEDIA** , **TITLE** , **BUSINESS** , **LEGAL** , **EVENT** , **ARTIFACT** , **INSTALLATION** , **PERSON_TYPE**, **NATIONAL**, **CONCEPT**, **CREATION**, **WEBSITE**, **SUBSTANCE**, **ACRONYM**, **IDENTIFIER**, **UNKNOWN**, **AWARD**, **ANIMAL**