File size: 6,665 Bytes
9e432c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8a4789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
language: 
- az
tags: 
- token-classification
- ner
- roberta
- multilingual
license: mit
datasets:
- LocalDoc/azerbaijani-ner-dataset
metrics:
- precision
- recall
- f1
model-index:
- name: XLM-RoBERTa Azerbaijani NER Model
  results:
  - task:
      name: Named Entity Recognition
      type: token-classification
    dataset:
      name: Azerbaijani NER Dataset
      type: LocalDoc/azerbaijani-ner-dataset
    metrics:
      - name: Precision
        type: precision
        value: 0.764390
      - name: Recall
        type: recall
        value: 0.740460
      - name: F1
        type: f1
        value: 0.752235
---



# XLM-RoBERTa Azerbaijani NER Model

[![Hugging Face Model](https://img.shields.io/badge/Hugging%20Face-Model-blue)](https://huggingface.co/IsmatS/xlm-roberta-az-ner)

This model is a fine-tuned version of **XLM-RoBERTa** for Named Entity Recognition (NER) in the Azerbaijani language. It recognizes several entity types commonly used in Azerbaijani text, providing high accuracy on tasks requiring entity extraction, such as personal names, locations, organizations, and dates.

## Model Details

- **Base Model**: `xlm-roberta-base`
- **Fine-tuned on**: [Azerbaijani Named Entity Recognition Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset)
- **Task**: Named Entity Recognition (NER)
- **Language**: Azerbaijani (az)
- **Dataset**: Custom Azerbaijani NER dataset with entity tags such as `PERSON`, `LOCATION`, `ORGANISATION`, `DATE`, etc.

### Data Source

The model was trained on the [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset), which provides annotated data with 25 distinct entity types specifically for the Azerbaijani language. This dataset is an invaluable resource for improving NLP tasks in Azerbaijani, including entity recognition and language understanding.

### Entity Types
The model recognizes the following entities:
- **PERSON**: Names of people
- **LOCATION**: Geographical locations
- **ORGANISATION**: Companies, institutions
- **DATE**: Dates and periods
- **MONEY**: Monetary values
- **TIME**: Time expressions
- **GPE**: Countries, cities, states
- **FACILITY**: Buildings, landmarks, etc.
- **EVENT**: Events and occurrences
- **...and more**

For the full list of entities, please refer to the dataset description.

## Performance Metrics

### Epoch-wise Performance

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1     |
|-------|---------------|-----------------|-----------|--------|--------|
| 1     | 0.323100      | 0.275503        | 0.775799  | 0.694886 | 0.733117 |
| 2     | 0.272500      | 0.262481        | 0.739266  | 0.739900 | 0.739583 |
| 3     | 0.248600      | 0.252498        | 0.751478  | 0.741152 | 0.746280 |
| 4     | 0.236800      | 0.249968        | 0.754882  | 0.741449 | 0.748105 |
| 5     | 0.223800      | 0.252187        | 0.764390  | 0.740460 | 0.752235 |
| 6     | 0.218600      | 0.249887        | 0.756352  | 0.741646 | 0.748927 |
| 7     | 0.209700      | 0.250748        | 0.760696  | 0.739438 | 0.749916 |

### Detailed Classification Report (Epoch 7)

This table summarizes the precision, recall, and F1-score for each entity type, calculated on the validation dataset.

| Entity Type    | Precision | Recall | F1-Score | Support |
|----------------|-----------|--------|----------|---------|
| ART            | 0.54      | 0.20   | 0.29     | 1857    |
| DATE           | 0.52      | 0.47   | 0.50     | 880     |
| EVENT          | 0.69      | 0.35   | 0.47     | 96      |
| FACILITY       | 0.69      | 0.69   | 0.69     | 1170    |
| LAW            | 0.60      | 0.61   | 0.60     | 1122    |
| LOCATION       | 0.77      | 0.82   | 0.80     | 9132    |
| MONEY          | 0.61      | 0.57   | 0.59     | 540     |
| ORGANISATION   | 0.69      | 0.68   | 0.69     | 544     |
| PERCENTAGE     | 0.79      | 0.82   | 0.81     | 3591    |
| PERSON         | 0.87      | 0.83   | 0.85     | 7037    |
| PRODUCT        | 0.83      | 0.85   | 0.84     | 2808    |
| TIME           | 0.55      | 0.51   | 0.53     | 1569    |

**Overall Metrics**:
- **Micro Average**: Precision = 0.76, Recall = 0.74, F1-Score = 0.75
- **Macro Average**: Precision = 0.68, Recall = 0.62, F1-Score = 0.64
- **Weighted Average**: Precision = 0.75, Recall = 0.74, F1-Score = 0.74

## Usage

You can use this model with the Hugging Face `transformers` library to perform NER on Azerbaijani text. Here’s an example:

### Installation

Make sure you have the `transformers` library installed:

```bash
pip install transformers
```

### Inference Example

Load the model and tokenizer, then run the NER pipeline on Azerbaijani text:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
model_name = "IsmatS/xlm-roberta-az-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Set up the NER pipeline
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example sentence
sentence = "Bakı şəhərində Azərbaycan Respublikasının prezidenti İlham Əliyev."
entities = nlp_ner(sentence)

# Display entities
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']}")
```

### Sample Output
```json
[
    {
        "entity_group": "PERSON",
        "score": 0.99,
        "word": "İlham Əliyev",
        "start": 34,
        "end": 46
    },
    {
        "entity_group": "LOCATION",
        "score": 0.98,
        "word": "Bakı",
        "start": 0,
        "end": 4
    }
]
```

## Training Details

- **Training Data**: This model was fine-tuned on the [Azerbaijani NER Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset) with 25 entity types.
- **Training Framework**: Hugging Face `transformers`
- **Optimizer**: AdamW
- **Epochs**: 8
- **Batch Size**: 64
- **Evaluation Metric**: F1-score

## Limitations

- The model is trained specifically for the Azerbaijani language and may not generalize well to other languages.
- Certain rare entities may be misclassified due to limited training data in those categories.

## Citation

If you use this model in your research or application, please consider citing:

```
@model{ismats_az_ner_2024,
  title={XLM-RoBERTa Azerbaijani NER Model},
  author={Ismat Samadov},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/IsmatS/xlm-roberta-az-ner}
}
```

## License

This model is available under the [MIT License](https://opensource.org/licenses/MIT).