|
|
|
--- |
|
language: |
|
- multilingual |
|
- ar |
|
- as |
|
- br |
|
- ca |
|
- cnh |
|
- cs |
|
- cv |
|
- cy |
|
- de |
|
- dv |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- hi |
|
- hsb |
|
- hu |
|
- ia |
|
- id |
|
- ja |
|
- ka |
|
- ky |
|
- lg |
|
- lt |
|
- ly |
|
- mn |
|
- mt |
|
- nl |
|
- or |
|
- pl |
|
- pt |
|
- ro |
|
- ru |
|
- sah |
|
- sl |
|
- ta |
|
- th |
|
- tr |
|
- tt |
|
- uk |
|
- vi |
|
license: apache-2.0 |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
- robust-speech-event |
|
- speech |
|
- xlsr-fine-tuning-week |
|
datasets: |
|
- common_voice |
|
language_bcp47: |
|
- fy-NL |
|
- ga-IE |
|
- pa-IN |
|
- rm-sursilv |
|
- rm-vallader |
|
- sy-SE |
|
- zh-CN |
|
- zh-HK |
|
- zh-TW |
|
model-index: |
|
- name: XLSR Wav2Vec2 for 56 language by Voidful |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Speech Recognition |
|
dataset: |
|
name: Common Voice |
|
type: common_voice |
|
metrics: |
|
- type: cer |
|
value: 23.21 |
|
name: Test CER |
|
--- |
|
|
|
# Model Card for wav2vec2-xlsr-multilingual-56 |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
- **Developed by:** voidful |
|
- **Shared by [Optional]:** Hugging Face |
|
- **Model type:** automatic-speech-recognition |
|
- **Language(s) (NLP):** multilingual (*56 language, 1 model Multilingual ASR*) |
|
- **License:** Apache-2.0 |
|
- **Related Models:** |
|
- **Parent Model:** wav2vec |
|
- **Resources for more information:** |
|
- [GitHub Repo](https://github.com/voidful/wav2vec2-xlsr-multilingual-56) |
|
- [Model Space](https://huggingface.co/spaces/Kamtera/Persian_Automatic_Speech_Recognition_and-more) |
|
|
|
|
|
# Uses |
|
|
|
|
|
## Direct Use |
|
|
|
This model can be used for the task of automatic-speech-recognition |
|
|
|
## Downstream Use [Optional] |
|
|
|
More information needed |
|
|
|
## Out-of-Scope Use |
|
|
|
The model should not be used to intentionally create hostile or alienating environments for people. |
|
|
|
# Bias, Risks, and Limitations |
|
|
|
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. |
|
|
|
|
|
## Recommendations |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
See the [common_voice dataset card](https://huggingface.co/datasets/common_voice) |
|
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on 56 language using the [Common Voice](https://huggingface.co/datasets/common_voice). |
|
|
|
## Training Procedure |
|
|
|
|
|
### Preprocessing |
|
|
|
More information needed |
|
|
|
### Speeds, Sizes, Times |
|
|
|
|
|
When using this model, make sure that your speech input is sampled at 16kHz. |
|
|
|
|
|
# Evaluation |
|
|
|
|
|
## Testing Data, Factors & Metrics |
|
|
|
### Testing Data |
|
|
|
More information needed |
|
|
|
### Factors |
|
|
|
|
|
### Metrics |
|
|
|
More information needed |
|
## Results |
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
| Common Voice Languages | Num. of data | Hour | WER | CER | |
|
|------------------------|--------------|--------|--------|-------| |
|
| ar | 21744 | 81.5 | 75.29 | 31.23 | |
|
| as | 394 | 1.1 | 95.37 | 46.05 | |
|
| br | 4777 | 7.4 | 93.79 | 41.16 | |
|
| ca | 301308 | 692.8 | 24.80 | 10.39 | |
|
| cnh | 1563 | 2.4 | 68.11 | 23.10 | |
|
| cs | 9773 | 39.5 | 67.86 | 12.57 | |
|
| cv | 1749 | 5.9 | 95.43 | 34.03 | |
|
| cy | 11615 | 106.7 | 67.03 | 23.97 | |
|
| de | 262113 | 822.8 | 27.03 | 6.50 | |
|
| dv | 4757 | 18.6 | 92.16 | 30.15 | |
|
| el | 3717 | 11.1 | 94.48 | 58.67 | |
|
| en | 580501 | 1763.6 | 34.87 | 14.84 | |
|
| eo | 28574 | 162.3 | 37.77 | 6.23 | |
|
| es | 176902 | 337.7 | 19.63 | 5.41 | |
|
| et | 5473 | 35.9 | 86.87 | 20.79 | |
|
| eu | 12677 | 90.2 | 44.80 | 7.32 | |
|
| fa | 12806 | 290.6 | 53.81 | 15.09 | |
|
| fi | 875 | 2.6 | 93.78 | 27.57 | |
|
| fr | 314745 | 664.1 | 33.16 | 13.94 | |
|
| fy-NL | 6717 | 27.2 | 72.54 | 26.58 | |
|
| ga-IE | 1038 | 3.5 | 92.57 | 51.02 | |
|
| hi | 292 | 2.0 | 90.95 | 57.43 | |
|
| hsb | 980 | 2.3 | 89.44 | 27.19 | |
|
| hu | 4782 | 9.3 | 97.15 | 36.75 | |
|
| ia | 5078 | 10.4 | 52.00 | 11.35 | |
|
| id | 3965 | 9.9 | 82.50 | 22.82 | |
|
| it | 70943 | 178.0 | 39.09 | 8.72 | |
|
| ja | 1308 | 8.2 | 99.21 | 62.06 | |
|
| ka | 1585 | 4.0 | 90.53 | 18.57 | |
|
| ky | 3466 | 12.2 | 76.53 | 19.80 | |
|
| lg | 1634 | 17.1 | 98.95 | 43.84 | |
|
| lt | 1175 | 3.9 | 92.61 | 26.81 | |
|
| lv | 4554 | 6.3 | 90.34 | 30.81 | |
|
| mn | 4020 | 11.6 | 82.68 | 30.14 | |
|
| mt | 3552 | 7.8 | 84.18 | 22.96 | |
|
| nl | 14398 | 71.8 | 57.18 | 19.01 | |
|
| or | 517 | 0.9 | 90.93 | 27.34 | |
|
| pa-IN | 255 | 0.8 | 87.95 | 42.03 | |
|
| pl | 12621 | 112.0 | 56.14 | 12.06 | |
|
| pt | 11106 | 61.3 | 53.24 | 16.32 | |
|
| rm-sursilv | 2589 | 5.9 | 78.17 | 23.31 | |
|
| rm-vallader | 931 | 2.3 | 73.67 | 21.76 | |
|
| ro | 4257 | 8.7 | 83.84 | 21.95 | |
|
| ru | 23444 | 119.1 | 61.83 | 15.18 | |
|
| sah | 1847 | 4.4 | 94.38 | 38.46 | |
|
| sl | 2594 | 6.7 | 84.21 | 20.54 | |
|
| sv-SE | 4350 | 20.8 | 83.68 | 30.79 | |
|
| ta | 3788 | 18.4 | 84.19 | 21.60 | |
|
| th | 4839 | 11.7 | 141.87 | 37.16 | |
|
| tr | 3478 | 22.3 | 66.77 | 15.55 | |
|
| tt | 13338 | 26.7 | 86.80 | 33.57 | |
|
| uk | 7271 | 39.4 | 70.23 | 14.34 | |
|
| vi | 421 | 1.7 | 96.06 | 66.25 | |
|
| zh-CN | 27284 | 58.7 | 89.67 | 23.96 | |
|
| zh-HK | 12678 | 92.1 | 81.77 | 18.82 | |
|
| zh-TW | 6402 | 56.6 | 85.08 | 29.07 | |
|
|
|
</details> |
|
# Model Examination |
|
|
|
More information needed |
|
|
|
# Environmental Impact |
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** More information needed |
|
- **Hours used:** More information needed |
|
- **Cloud Provider:** More information needed |
|
- **Compute Region:** More information needed |
|
- **Carbon Emitted:** More information needed |
|
|
|
# Technical Specifications [optional] |
|
|
|
## Model Architecture and Objective |
|
|
|
More information needed |
|
|
|
## Compute Infrastructure |
|
|
|
More information needed |
|
|
|
### Hardware |
|
|
|
More information needed |
|
|
|
### Software |
|
More information needed |
|
|
|
# Citation |
|
|
|
|
|
**BibTeX:** |
|
``` |
|
More information needed |
|
``` |
|
|
|
**APA:** |
|
``` |
|
More information needed |
|
``` |
|
|
|
# Glossary [optional] |
|
More information needed |
|
|
|
# More Information [optional] |
|
|
|
More information needed |
|
|
|
# Model Card Authors [optional] |
|
|
|
voidful in collaboration with Ezi Ozoani and the Hugging Face team |
|
|
|
# Model Card Contact |
|
|
|
More information needed |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
|
|
## Env setup: |
|
``` |
|
!pip install torchaudio |
|
!pip install datasets transformers |
|
!pip install asrp |
|
!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk |
|
``` |
|
|
|
## Usage |
|
|
|
``` |
|
import torchaudio |
|
from datasets import load_dataset, load_metric |
|
from transformers import ( |
|
Wav2Vec2ForCTC, |
|
Wav2Vec2Processor, |
|
AutoTokenizer, |
|
AutoModelWithLMHead |
|
) |
|
import torch |
|
import re |
|
import sys |
|
import soundfile as sf |
|
model_name = "voidful/wav2vec2-xlsr-multilingual-56" |
|
device = "cuda" |
|
processor_name = "voidful/wav2vec2-xlsr-multilingual-56" |
|
|
|
import pickle |
|
with open("lang_ids.pk", 'rb') as output: |
|
lang_ids = pickle.load(output) |
|
|
|
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device) |
|
processor = Wav2Vec2Processor.from_pretrained(processor_name) |
|
|
|
model.eval() |
|
|
|
def load_file_to_data(file,sampling_rate=16_000): |
|
batch = {} |
|
speech, _ = torchaudio.load(file) |
|
if sampling_rate != '16_000' or sampling_rate != '16000': |
|
resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000) |
|
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy() |
|
batch["sampling_rate"] = resampler.new_freq |
|
else: |
|
batch["speech"] = speech.squeeze(0).numpy() |
|
batch["sampling_rate"] = '16000' |
|
return batch |
|
|
|
|
|
def predict(data): |
|
features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt") |
|
input_values = features.input_values.to(device) |
|
attention_mask = features.attention_mask.to(device) |
|
with torch.no_grad(): |
|
logits = model(input_values, attention_mask=attention_mask).logits |
|
decoded_results = [] |
|
for logit in logits: |
|
pred_ids = torch.argmax(logit, dim=-1) |
|
mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size()) |
|
vocab_size = logit.size()[-1] |
|
voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1) |
|
comb_pred_ids = torch.argmax(voice_prob, dim=-1) |
|
decoded_results.append(processor.decode(comb_pred_ids)) |
|
|
|
return decoded_results |
|
|
|
def predict_lang_specific(data,lang_code): |
|
features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt") |
|
input_values = features.input_values.to(device) |
|
attention_mask = features.attention_mask.to(device) |
|
with torch.no_grad(): |
|
logits = model(input_values, attention_mask=attention_mask).logits |
|
decoded_results = [] |
|
for logit in logits: |
|
pred_ids = torch.argmax(logit, dim=-1) |
|
mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size()) |
|
vocab_size = logit.size()[-1] |
|
voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1) |
|
filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device) |
|
if len(filtered_input[0]) == 0: |
|
decoded_results.append("") |
|
else: |
|
lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0) |
|
lang_index = torch.tensor(sorted(lang_ids[lang_code])) |
|
lang_mask.index_fill_(0, lang_index, 1) |
|
lang_mask = lang_mask.to(device) |
|
comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1) |
|
decoded_results.append(processor.decode(comb_pred_ids)) |
|
|
|
return decoded_results |
|
|
|
|
|
predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate |
|
|
|
predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate |
|
|
|
``` |
|
|
|
```python |
|
{{ get_started_code | default("More information needed", true)}} |
|
``` |
|
</details> |
|
|
|
|