kunnark commited on
Commit
ad1ef3a
1 Parent(s): 9ea5efd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +243 -0
README.md CHANGED
@@ -1,3 +1,246 @@
1
  ---
2
  license: cc-by-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  ---
4
+ ---
5
+ language: multilingual
6
+ thumbnail:
7
+ tags:
8
+ - audio-classification
9
+ - speechbrain
10
+ - embeddings
11
+ - Language
12
+ - Identification
13
+ - pytorch
14
+ - wav2vec2.0
15
+ - XLS-R-300M
16
+ - VoxLingua107
17
+ license: "apache-2.0"
18
+ datasets:
19
+ - VoxLingua107
20
+ metrics:
21
+ - Accuracy
22
+ widget:
23
+ - example_title: English Sample
24
+ src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
25
+ ---
26
+ # VoxLingua107 Wav2Vec Spoken Language Identification Model
27
+ ## Model description
28
+ This is a spoken language identification model trained on the VoxLingua107 dataset using SpeechBrain.
29
+
30
+ The model is trained using weights of pretrained [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) model, Wav2Vec2.0 architecture and negative log likelihood loss.
31
+
32
+ The model can classify a speech utterance according to the language spoken.
33
+ It covers 107 different languages (
34
+ Abkhazian,
35
+ Afrikaans,
36
+ Amharic,
37
+ Arabic,
38
+ Assamese,
39
+ Azerbaijani,
40
+ Bashkir,
41
+ Belarusian,
42
+ Bulgarian,
43
+ Bengali,
44
+ Tibetan,
45
+ Breton,
46
+ Bosnian,
47
+ Catalan,
48
+ Cebuano,
49
+ Czech,
50
+ Welsh,
51
+ Danish,
52
+ German,
53
+ Greek,
54
+ English,
55
+ Esperanto,
56
+ Spanish,
57
+ Estonian,
58
+ Basque,
59
+ Persian,
60
+ Finnish,
61
+ Faroese,
62
+ French,
63
+ Galician,
64
+ Guarani,
65
+ Gujarati,
66
+ Manx,
67
+ Hausa,
68
+ Hawaiian,
69
+ Hindi,
70
+ Croatian,
71
+ Haitian,
72
+ Hungarian,
73
+ Armenian,
74
+ Interlingua,
75
+ Indonesian,
76
+ Icelandic,
77
+ Italian,
78
+ Hebrew,
79
+ Japanese,
80
+ Javanese,
81
+ Georgian,
82
+ Kazakh,
83
+ Central Khmer,
84
+ Kannada,
85
+ Korean,
86
+ Latin,
87
+ Luxembourgish,
88
+ Lingala,
89
+ Lao,
90
+ Lithuanian,
91
+ Latvian,
92
+ Malagasy,
93
+ Maori,
94
+ Macedonian,
95
+ Malayalam,
96
+ Mongolian,
97
+ Marathi,
98
+ Malay,
99
+ Maltese,
100
+ Burmese,
101
+ Nepali,
102
+ Dutch,
103
+ Norwegian Nynorsk,
104
+ Norwegian,
105
+ Occitan,
106
+ Panjabi,
107
+ Polish,
108
+ Pushto,
109
+ Portuguese,
110
+ Romanian,
111
+ Russian,
112
+ Sanskrit,
113
+ Scots,
114
+ Sindhi,
115
+ Sinhala,
116
+ Slovak,
117
+ Slovenian,
118
+ Shona,
119
+ Somali,
120
+ Albanian,
121
+ Serbian,
122
+ Sundanese,
123
+ Swedish,
124
+ Swahili,
125
+ Tamil,
126
+ Telugu,
127
+ Tajik,
128
+ Thai,
129
+ Turkmen,
130
+ Tagalog,
131
+ Turkish,
132
+ Tatar,
133
+ Ukrainian,
134
+ Urdu,
135
+ Uzbek,
136
+ Vietnamese,
137
+ Waray,
138
+ Yiddish,
139
+ Yoruba,
140
+ Mandarin Chinese).
141
+ ## Intended uses & limitations
142
+ The model has two uses:
143
+ - use 'as is' for spoken language recognition
144
+ - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
145
+
146
+ The model is trained on automatically collected YouTube data. For more
147
+ information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
148
+ #### How to use
149
+ ```python
150
+ import torchaudio
151
+ import os
152
+ from speechbrain.pretrained.interfaces import foreign_class
153
+
154
+
155
+ language_id = foreign_class(source="TalTechNLP/voxlingua107-xls-r-300m-wav2vec", pymodule_file="encoder_wav2vec_classifier.py", classname="EncoderWav2vecClassifier", hparams_file='inference_wav2vec.yaml', savedir="tmp")
156
+
157
+ # Download Thai language sample from Omniglot and convert to suitable form
158
+ wav_file = "https://omniglot.com/soundfiles/udhr/udhr_th.mp3"
159
+ out_prob, score, index, text_lab = language_id.classify_file(wav_file)
160
+
161
+ print("probability:", out_prob)
162
+ print("label:", text_lab)
163
+ print("score:", score)
164
+ print("index:", index)
165
+ probability: tensor([[[-2.2849e+01, -2.4349e+01, -2.3686e+01, -2.3632e+01, -2.0218e+01,
166
+ -2.7241e+01, -2.6715e+01, -2.2301e+01, -2.6076e+01, -2.1716e+01,
167
+ -1.9923e+01, -2.7303e+01, -2.1211e+01, -2.2998e+01, -2.4436e+01,
168
+ -2.6437e+01, -2.2686e+01, -2.4244e+01, -2.0416e+01, -2.8329e+01,
169
+ -1.7788e+01, -2.4829e+01, -2.4186e+01, -2.7036e+01, -2.5993e+01,
170
+ -1.9677e+01, -2.2746e+01, -2.9192e+01, -2.4941e+01, -2.7135e+01,
171
+ -2.6653e+01, -2.2791e+01, -2.4599e+01, -2.1066e+01, -2.4855e+01,
172
+ -2.1874e+01, -2.2914e+01, -2.4174e+01, -2.0902e+01, -2.3197e+01,
173
+ -2.6108e+01, -2.3941e+01, -2.3103e+01, -2.2363e+01, -2.8969e+01,
174
+ -2.5302e+01, -2.4862e+01, -2.2392e+01, -2.4042e+01, -2.1221e+01,
175
+ -2.3656e+01, -2.1286e+01, -1.9209e+01, -2.3254e+01, -2.8291e+01,
176
+ -5.9105e+00, -2.4525e+01, -2.4937e+01, -2.8349e+01, -2.4420e+01,
177
+ -2.7439e+01, -2.6329e+01, -2.3317e+01, -2.3842e+01, -2.2114e+01,
178
+ -2.3637e+01, -1.7217e+01, -1.8342e+01, -2.4332e+01, -2.6090e+01,
179
+ -2.5452e+01, -2.3854e+01, -2.6082e+01, -2.4992e+01, -2.0618e+01,
180
+ -2.9351e+01, -2.4153e+01, -2.3156e+01, -2.6893e+01, -2.5314e+01,
181
+ -2.8374e+01, -2.4009e+01, -2.3604e+01, -2.4063e+01, -2.3538e+01,
182
+ -2.4953e+01, -2.5607e+01, -2.3960e+01, -2.6471e+01, -2.3348e+01,
183
+ -2.1681e+01, -2.7610e+01, -2.5023e+01, -2.3585e+01, -2.7146e-03,
184
+ -2.0338e+01, -1.8737e+01, -2.5158e+01, -2.7491e+01, -2.3623e+01,
185
+ -2.5718e+01, -2.3465e+01, -1.8305e+01, -2.1064e+01, -2.9880e+01,
186
+ -2.2809e+01, -1.9856e+01]]])
187
+ # The identified language ISO code is given in score[0][0]
188
+ label: [['th']]
189
+ score: tensor([[-0.0027]])
190
+ index: tensor([[94]])
191
+
192
+ # The scores in the out_prob tensor can be interpreted as log-likelihoods that
193
+ # the given utterance belongs to the given language (i.e., the larger the better)
194
+ # The linear-scale likelihood can be retrieved using the following:
195
+
196
+ print(score.exp())
197
+ tensor([0.9973])
198
+
199
+ # Alternatively, use the utterance embedding extractor:
200
+ signal, fs = torchaudio.load(wav_file)
201
+ embeddings = language_id.encode_batch(signal)
202
+ print(embeddings.shape)
203
+ torch.Size([2, 1, 2048])
204
+ ```
205
+ #### Limitations and bias
206
+ Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
207
+
208
+ - Probably it's accuracy on smaller languages is quite limited
209
+ - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
210
+ - Based on experiments, it performs satisfactory on accented speech
211
+ - Probably it doesn't work well on children's speech and on persons with speech disorders
212
+
213
+ ## Training data
214
+
215
+ The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
216
+
217
+ VoxLingua107 is a speech dataset for training spoken language identification models.
218
+ The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
219
+
220
+ VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
221
+ The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
222
+
223
+ ## Training procedure
224
+
225
+ We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
226
+ Training recipe will be published soon.
227
+
228
+ ## Evaluation results
229
+
230
+ | Version | Error Rate (%) |
231
+ |-----------------------|:------:|
232
+ | 2022-04-14 | 5.6 |
233
+
234
+ Error rate is calculated on VoxLingua107 development dataset.
235
+
236
+
237
+ ### BibTeX entry and citation info
238
+
239
+ ```bibtex
240
+ @inproceedings{valk2021slt,
241
+ title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
242
+ author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
243
+ booktitle={Proc. IEEE SLT Workshop},
244
+ year={2021},
245
+ }
246
+ ```