ai-forever
commited on
Commit
•
13d233d
1
Parent(s):
c8d1248
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -19,18 +19,45 @@ Turkmen belongs to Turkic language family. It's a very energetic language with a
|
|
19 |
2. It uses the Latin script, having switched from the Cyrillic script after Turkmenistan gained independence.
|
20 |
3. The language has been influenced by other languages like Russian and Persian due to historical interactions.
|
21 |
|
22 |
-
##
|
23 |
|
24 |
-
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
35 |
|
36 |
Model will be improved over time. Stay tuned!
|
|
|
19 |
2. It uses the Latin script, having switched from the Cyrillic script after Turkmenistan gained independence.
|
20 |
3. The language has been influenced by other languages like Russian and Persian due to historical interactions.
|
21 |
|
22 |
+
## Technical details
|
23 |
|
24 |
+
It's one of the models derived from the base [mGPT-XL (1.3B)](https://huggingface.co/ai-forever/mGPT) model (see the list below) which was originally trained on the 61 languages from 25 language families using Wikipedia and C4 corpus.
|
25 |
|
26 |
+
We've found additional data for 23 languages most of which are considered as minor and decided to further tune the base model. **Turkmen mGPT 1.3B** was trained for another 200 steps with batch_size=4 and context window of **2048** tokens on 1 A100.
|
27 |
+
|
28 |
+
Final perplexity for this model on validation is **28.47**.
|
29 |
+
|
30 |
+
_Chart of the training loss and perplexity:_
|
31 |
+
|
32 |
+
![](https://i.imgur.com/E8K4ODA.png)
|
33 |
|
34 |
+
## Other mGPT-1.3B models
|
35 |
|
36 |
+
- [mGPT-1.3B-armenian](https://huggingface.co/ai-forever/mGPT-1.3B-armenian)
|
37 |
+
- [mGPT-1.3B-azerbaijan](https://huggingface.co/ai-forever/mGPT-1.3B-azerbaijan)
|
38 |
+
- [mGPT-1.3B-bashkir](https://huggingface.co/ai-forever/mGPT-1.3B-bashkir)
|
39 |
+
- [mGPT-1.3B-belorussian](https://huggingface.co/ai-forever/mGPT-1.3B-belorussian)
|
40 |
+
- [mGPT-1.3B-bulgarian](https://huggingface.co/ai-forever/mGPT-1.3B-bulgarian)
|
41 |
+
- [mGPT-1.3B-buryat](https://huggingface.co/ai-forever/mGPT-1.3B-buryat)
|
42 |
+
- [mGPT-1.3B-chuvash](https://huggingface.co/ai-forever/mGPT-1.3B-chuvash)
|
43 |
+
- [mGPT-1.3B-georgian](https://huggingface.co/ai-forever/mGPT-1.3B-georgian)
|
44 |
+
- [mGPT-1.3B-kalmyk](https://huggingface.co/ai-forever/mGPT-1.3B-kalmyk)
|
45 |
+
- [mGPT-1.3B-kazakh](https://huggingface.co/ai-forever/mGPT-1.3B-kazakh)
|
46 |
+
- [mGPT-1.3B-kirgiz](https://huggingface.co/ai-forever/mGPT-1.3B-kirgiz)
|
47 |
+
- [mGPT-1.3B-mari](https://huggingface.co/ai-forever/mGPT-1.3B-mari)
|
48 |
+
- [mGPT-1.3B-mongol](https://huggingface.co/ai-forever/mGPT-1.3B-mongol)
|
49 |
+
- [mGPT-1.3B-ossetian](https://huggingface.co/ai-forever/mGPT-1.3B-ossetian)
|
50 |
+
- [mGPT-1.3B-persian](https://huggingface.co/ai-forever/mGPT-1.3B-persian)
|
51 |
+
- [mGPT-1.3B-romanian](https://huggingface.co/ai-forever/mGPT-1.3B-romanian)
|
52 |
+
- [mGPT-1.3B-tajik](https://huggingface.co/ai-forever/mGPT-1.3B-tajik)
|
53 |
+
- [mGPT-1.3B-tatar](https://huggingface.co/ai-forever/mGPT-1.3B-tatar)
|
54 |
+
- [mGPT-1.3B-tuvan](https://huggingface.co/ai-forever/mGPT-1.3B-tuvan)
|
55 |
+
- [mGPT-1.3B-ukranian](https://huggingface.co/ai-forever/mGPT-1.3B-ukranian)
|
56 |
+
- [mGPT-1.3B-uzbek](https://huggingface.co/ai-forever/mGPT-1.3B-uzbek)
|
57 |
+
- [mGPT-1.3B-yakut](https://huggingface.co/ai-forever/mGPT-1.3B-yakut)
|
58 |
|
59 |
+
## Feedback
|
60 |
|
61 |
+
If you'll found a bug of have additional data to train model on your language — please, give us feedback.
|
62 |
|
63 |
Model will be improved over time. Stay tuned!
|