File size: 4,491 Bytes
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177980f
5d00664
 
177980f
65981dd
28a2979
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
84e392f
5d00664
5c4046f
3e3c87e
5d00664
 
 
3e3c87e
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84e392f
 
3e3c87e
 
 
 
 
 
d6b9877
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e3c87e
 
5d00664
 
 
 
 
 
d889a42
 
 
5d00664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d889a42
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
language:
- ar
- az
- bg
- de
- el
- en
- es
- fr
- hi
- it
- ja
- nl
- pl
- pt
- ru
- sw
- th
- tr
- ur
- vi
- zh
license: cc-by-nc-4.0
tags:
- language detect
pipeline_tag: text-classification
widget:
- text: "Əlqasım oğulları vorzakondu"
---

# Multilingual Language Detection Model

## Model Description
This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.

## How to Use
You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below.

### Quick Start
First, install the transformers library if you haven't already:
```bash
pip install transformers
```

```python
from transformers import AutoModelForSequenceClassification, XLMRobertaTokenizer
import torch

# Load tokenizer and model
tokenizer = XLMRobertaTokenizer.from_pretrained("LocalDoc/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")

# Prepare text
text = "Əlqasım oğulları vorzakondu"
encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

# Prediction
model.eval()
with torch.no_grad():
    outputs = model(**encoded_input)

# Process the outputs
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class_index = probabilities.argmax().item()
labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
predicted_label = labels[predicted_class_index]
print(f"Predicted Language: {predicted_label}")
```

## Language Label Information

The model outputs a label for each prediction, corresponding to one of the languages listed below. Each label is associated with a specific language code as detailed in the following table:

| Label | Language Code | Language Name |
|-------|---------------|---------------|
| LABEL_0     | az            | Azerbaijani   |
| LABEL_1     | ar            | Arabic        |
| LABEL_2     | bg            | Bulgarian     |
| LABEL_3     | de            | German        |
| LABEL_4     | el            | Greek         |
| LABEL_5     | en            | English       |
| LABEL_6     | es            | Spanish       |
| LABEL_7     | fr            | French        |
| LABEL_8     | hi            | Hindi         |
| LABEL_9     | it            | Italian       |
| LABEL_10    | ja            | Japanese      |
| LABEL_11    | nl            | Dutch         |
| LABEL_12    | pl            | Polish        |
| LABEL_13    | pt            | Portuguese    |
| LABEL_14    | ru            | Russian       |
| LABEL_15    | sw            | Swahili       |
| LABEL_16    | th            | Thai          |
| LABEL_17    | tr            | Turkish       |
| LABEL_18    | ur            | Urdu          |
| LABEL_19    | vi            | Vietnamese    |
| LABEL_20    | zh            | Chinese       |

This mapping is utilized to decode the model's predictions into understandable language names, facilitating the interpretation of results for further processing or analysis.


Training Performance

The model was trained over three epochs, showing consistent improvement in accuracy and loss:

    Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
    Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
    Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984

Test Results

The model achieved the following results on the test set:

    Loss: 0.0133
    Accuracy: 0.9975
    F1 Score: 0.9975
    Precision: 0.9975
    Recall: 0.9975
    Evaluation Time: 17.5 seconds
    Samples per Second: 599.685
    Steps per Second: 9.424


License

The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International  license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works.



Contact information

If you have any questions or suggestions, please contact us at [[email protected]].