TalTechNLP
/

voxlingua107-xls-r-300m-wav2vec

 ---
 license: cc-by-4.0
 ---
+---
+language: multilingual
+thumbnail:
+tags:
+- audio-classification
+- speechbrain
+- embeddings
+- Language
+- Identification
+- pytorch
+- wav2vec2.0
+- XLS-R-300M
+- VoxLingua107
+license: "apache-2.0"
+datasets:
+- VoxLingua107
+metrics:
+- Accuracy
+widget:
+- example_title: English Sample
+  src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
+---
+# VoxLingua107 Wav2Vec Spoken Language Identification Model
+## Model description
+This is a spoken language identification model trained on the VoxLingua107 dataset using SpeechBrain.
+The model is trained using weights of pretrained [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) model, Wav2Vec2.0 architecture and negative log likelihood loss.
+The model can classify a speech utterance according to the language spoken.
+It covers 107 different languages (
+Abkhazian,
+Afrikaans,
+Amharic,
+Arabic,
+Assamese,
+Azerbaijani,
+Bashkir,
+Belarusian,
+Bulgarian,
+Bengali,
+Tibetan,
+Breton,
+Bosnian,
+Catalan,
+Cebuano,
+Czech,
+Welsh,
+Danish,
+German,
+Greek,
+English,
+Esperanto,
+Spanish,
+Estonian,
+Basque,
+Persian,
+Finnish,
+Faroese,
+French,
+Galician,
+Guarani,
+Gujarati,
+Manx,
+Hausa,
+Hawaiian,
+Hindi,
+Croatian,
+Haitian,
+Hungarian,
+Armenian,
+Interlingua,
+Indonesian,
+Icelandic,
+Italian,
+Hebrew,
+Japanese,
+Javanese,
+Georgian,
+Kazakh,
+Central Khmer,
+Kannada,
+Korean,
+Latin,
+Luxembourgish,
+Lingala,
+Lao,
+Lithuanian,
+Latvian,
+Malagasy,
+Maori,
+Macedonian,
+Malayalam,
+Mongolian,
+Marathi,
+Malay,
+Maltese,
+Burmese,
+Nepali,
+Dutch,
+Norwegian Nynorsk,
+Norwegian,
+Occitan,
+Panjabi,
+Polish,
+Pushto,
+Portuguese,
+Romanian,
+Russian,
+Sanskrit,
+Scots,
+Sindhi,
+Sinhala,
+Slovak,
+Slovenian,
+Shona,
+Somali,
+Albanian,
+Serbian,
+Sundanese,
+Swedish,
+Swahili,
+Tamil,
+Telugu,
+Tajik,
+Thai,
+Turkmen,
+Tagalog,
+Turkish,
+Tatar,
+Ukrainian,
+Urdu,
+Uzbek,
+Vietnamese,
+Waray,
+Yiddish,
+Yoruba,
+Mandarin Chinese).
+## Intended uses & limitations
+The model has two uses:
+  - use 'as is' for spoken language recognition
+  - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
+The model is trained on automatically collected YouTube data. For more
+information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
+#### How to use
+```python
+import torchaudio
+import os
+from speechbrain.pretrained.interfaces import foreign_class
+language_id = foreign_class(source="TalTechNLP/voxlingua107-xls-r-300m-wav2vec", pymodule_file="encoder_wav2vec_classifier.py", classname="EncoderWav2vecClassifier", hparams_file='inference_wav2vec.yaml', savedir="tmp")
+# Download Thai language sample from Omniglot and convert to suitable form
+wav_file = "https://omniglot.com/soundfiles/udhr/udhr_th.mp3"
+out_prob, score, index, text_lab = language_id.classify_file(wav_file)
+print("probability:", out_prob)
+print("label:", text_lab)
+print("score:", score)
+print("index:", index)
+    probability: tensor([[[-2.2849e+01, -2.4349e+01, -2.3686e+01, -2.3632e+01, -2.0218e+01,
+            -2.7241e+01, -2.6715e+01, -2.2301e+01, -2.6076e+01, -2.1716e+01,
+            -1.9923e+01, -2.7303e+01, -2.1211e+01, -2.2998e+01, -2.4436e+01,
+            -2.6437e+01, -2.2686e+01, -2.4244e+01, -2.0416e+01, -2.8329e+01,
+            -1.7788e+01, -2.4829e+01, -2.4186e+01, -2.7036e+01, -2.5993e+01,
+            -1.9677e+01, -2.2746e+01, -2.9192e+01, -2.4941e+01, -2.7135e+01,
+            -2.6653e+01, -2.2791e+01, -2.4599e+01, -2.1066e+01, -2.4855e+01,
+            -2.1874e+01, -2.2914e+01, -2.4174e+01, -2.0902e+01, -2.3197e+01,
+            -2.6108e+01, -2.3941e+01, -2.3103e+01, -2.2363e+01, -2.8969e+01,
+            -2.5302e+01, -2.4862e+01, -2.2392e+01, -2.4042e+01, -2.1221e+01,
+            -2.3656e+01, -2.1286e+01, -1.9209e+01, -2.3254e+01, -2.8291e+01,
+            -5.9105e+00, -2.4525e+01, -2.4937e+01, -2.8349e+01, -2.4420e+01,
+            -2.7439e+01, -2.6329e+01, -2.3317e+01, -2.3842e+01, -2.2114e+01,
+            -2.3637e+01, -1.7217e+01, -1.8342e+01, -2.4332e+01, -2.6090e+01,
+            -2.5452e+01, -2.3854e+01, -2.6082e+01, -2.4992e+01, -2.0618e+01,
+            -2.9351e+01, -2.4153e+01, -2.3156e+01, -2.6893e+01, -2.5314e+01,
+            -2.8374e+01, -2.4009e+01, -2.3604e+01, -2.4063e+01, -2.3538e+01,
+            -2.4953e+01, -2.5607e+01, -2.3960e+01, -2.6471e+01, -2.3348e+01,
+            -2.1681e+01, -2.7610e+01, -2.5023e+01, -2.3585e+01, -2.7146e-03,
+            -2.0338e+01, -1.8737e+01, -2.5158e+01, -2.7491e+01, -2.3623e+01,
+            -2.5718e+01, -2.3465e+01, -1.8305e+01, -2.1064e+01, -2.9880e+01,
+            -2.2809e+01, -1.9856e+01]]])
+    # The identified language ISO code is given in score[0][0]
+    label: [['th']]
+    score: tensor([[-0.0027]])
+    index: tensor([[94]])
+# The scores in the out_prob tensor can be interpreted as log-likelihoods that
+# the given utterance belongs to the given language (i.e., the larger the better)
+# The linear-scale likelihood can be retrieved using the following:
+print(score.exp())
+  tensor([0.9973])
+# Alternatively, use the utterance embedding extractor:
+signal, fs = torchaudio.load(wav_file)
+embeddings =  language_id.encode_batch(signal)
+print(embeddings.shape)
+    torch.Size([2, 1, 2048])
+```
+#### Limitations and bias
+Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
+ - Probably it's accuracy on smaller languages is quite limited
+ - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
+ - Based on experiments, it performs satisfactory on accented speech
+ - Probably it doesn't work well on children's speech and on persons with speech disorders
+## Training data
+The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
+VoxLingua107 is a speech dataset for training spoken language identification models.
+The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
+VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
+The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
+## Training procedure
+We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
+Training recipe will be published soon.
+## Evaluation results
+| Version               | Error Rate (%) |
+|-----------------------|:------:|
+| 2022-04-14     |   5.6  |
+Error rate is calculated on VoxLingua107 development dataset.
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{valk2021slt,
+  title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
+  author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
+  booktitle={Proc. IEEE SLT Workshop},
+  year={2021},
+}
+```