wav2vec2-xlsr-multilingual-56 / README.md

model documentation (#3)

5eb0674 about 2 years ago

12.1 kB


	---
	language:
	- multilingual
	- ar
	- as
	- br
	- ca
	- cnh
	- cs
	- cv
	- cy
	- de
	- dv
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- hi
	- hsb
	- hu
	- ia
	- id
	- ja
	- ka
	- ky
	- lg
	- lt
	- ly
	- mn
	- mt
	- nl
	- or
	- pl
	- pt
	- ro
	- ru
	- sah
	- sl
	- ta
	- th
	- tr
	- tt
	- uk
	- vi
	license: apache-2.0
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	- robust-speech-event
	- speech
	- xlsr-fine-tuning-week
	datasets:
	- common_voice
	language_bcp47:
	- fy-NL
	- ga-IE
	- pa-IN
	- rm-sursilv
	- rm-vallader
	- sy-SE
	- zh-CN
	- zh-HK
	- zh-TW
	model-index:
	- name: XLSR Wav2Vec2 for 56 language by Voidful
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: Common Voice
	type: common_voice
	metrics:
	- type: cer
	value: 23.21
	name: Test CER
	---

	# Model Card for wav2vec2-xlsr-multilingual-56


	# Model Details

	## Model Description

	- Developed by: voidful
	- Shared by [Optional]: Hugging Face
	- Model type: automatic-speech-recognition
	- Language(s) (NLP): multilingual (56 language, 1 model Multilingual ASR)
	- License: Apache-2.0
	- Related Models:
	- Parent Model: wav2vec
	- Resources for more information:
	- [GitHub Repo](https://github.com/voidful/wav2vec2-xlsr-multilingual-56)
	- [Model Space](https://huggingface.co/spaces/Kamtera/Persian_Automatic_Speech_Recognition_and-more)


	# Uses


	## Direct Use

	This model can be used for the task of automatic-speech-recognition

	## Downstream Use [Optional]

	More information needed

	## Out-of-Scope Use

	The model should not be used to intentionally create hostile or alienating environments for people.

	# Bias, Risks, and Limitations

	Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.


	## Recommendations

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.


	# Training Details

	## Training Data

	See the [common_voice dataset card](https://huggingface.co/datasets/common_voice)
	Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on 56 language using the [Common Voice](https://huggingface.co/datasets/common_voice).

	## Training Procedure


	### Preprocessing

	More information needed

	### Speeds, Sizes, Times


	When using this model, make sure that your speech input is sampled at 16kHz.


	# Evaluation


	## Testing Data, Factors & Metrics

	### Testing Data

	More information needed

	### Factors


	### Metrics

	More information needed
	## Results
	<details>
	<summary> Click to expand </summary>

	\| Common Voice Languages \| Num. of data \| Hour \| WER \| CER \|
	\|------------------------\|--------------\|--------\|--------\|-------\|
	\| ar \| 21744 \| 81.5 \| 75.29 \| 31.23 \|
	\| as \| 394 \| 1.1 \| 95.37 \| 46.05 \|
	\| br \| 4777 \| 7.4 \| 93.79 \| 41.16 \|
	\| ca \| 301308 \| 692.8 \| 24.80 \| 10.39 \|
	\| cnh \| 1563 \| 2.4 \| 68.11 \| 23.10 \|
	\| cs \| 9773 \| 39.5 \| 67.86 \| 12.57 \|
	\| cv \| 1749 \| 5.9 \| 95.43 \| 34.03 \|
	\| cy \| 11615 \| 106.7 \| 67.03 \| 23.97 \|
	\| de \| 262113 \| 822.8 \| 27.03 \| 6.50 \|
	\| dv \| 4757 \| 18.6 \| 92.16 \| 30.15 \|
	\| el \| 3717 \| 11.1 \| 94.48 \| 58.67 \|
	\| en \| 580501 \| 1763.6 \| 34.87 \| 14.84 \|
	\| eo \| 28574 \| 162.3 \| 37.77 \| 6.23 \|
	\| es \| 176902 \| 337.7 \| 19.63 \| 5.41 \|
	\| et \| 5473 \| 35.9 \| 86.87 \| 20.79 \|
	\| eu \| 12677 \| 90.2 \| 44.80 \| 7.32 \|
	\| fa \| 12806 \| 290.6 \| 53.81 \| 15.09 \|
	\| fi \| 875 \| 2.6 \| 93.78 \| 27.57 \|
	\| fr \| 314745 \| 664.1 \| 33.16 \| 13.94 \|
	\| fy-NL \| 6717 \| 27.2 \| 72.54 \| 26.58 \|
	\| ga-IE \| 1038 \| 3.5 \| 92.57 \| 51.02 \|
	\| hi \| 292 \| 2.0 \| 90.95 \| 57.43 \|
	\| hsb \| 980 \| 2.3 \| 89.44 \| 27.19 \|
	\| hu \| 4782 \| 9.3 \| 97.15 \| 36.75 \|
	\| ia \| 5078 \| 10.4 \| 52.00 \| 11.35 \|
	\| id \| 3965 \| 9.9 \| 82.50 \| 22.82 \|
	\| it \| 70943 \| 178.0 \| 39.09 \| 8.72 \|
	\| ja \| 1308 \| 8.2 \| 99.21 \| 62.06 \|
	\| ka \| 1585 \| 4.0 \| 90.53 \| 18.57 \|
	\| ky \| 3466 \| 12.2 \| 76.53 \| 19.80 \|
	\| lg \| 1634 \| 17.1 \| 98.95 \| 43.84 \|
	\| lt \| 1175 \| 3.9 \| 92.61 \| 26.81 \|
	\| lv \| 4554 \| 6.3 \| 90.34 \| 30.81 \|
	\| mn \| 4020 \| 11.6 \| 82.68 \| 30.14 \|
	\| mt \| 3552 \| 7.8 \| 84.18 \| 22.96 \|
	\| nl \| 14398 \| 71.8 \| 57.18 \| 19.01 \|
	\| or \| 517 \| 0.9 \| 90.93 \| 27.34 \|
	\| pa-IN \| 255 \| 0.8 \| 87.95 \| 42.03 \|
	\| pl \| 12621 \| 112.0 \| 56.14 \| 12.06 \|
	\| pt \| 11106 \| 61.3 \| 53.24 \| 16.32 \|
	\| rm-sursilv \| 2589 \| 5.9 \| 78.17 \| 23.31 \|
	\| rm-vallader \| 931 \| 2.3 \| 73.67 \| 21.76 \|
	\| ro \| 4257 \| 8.7 \| 83.84 \| 21.95 \|
	\| ru \| 23444 \| 119.1 \| 61.83 \| 15.18 \|
	\| sah \| 1847 \| 4.4 \| 94.38 \| 38.46 \|
	\| sl \| 2594 \| 6.7 \| 84.21 \| 20.54 \|
	\| sv-SE \| 4350 \| 20.8 \| 83.68 \| 30.79 \|
	\| ta \| 3788 \| 18.4 \| 84.19 \| 21.60 \|
	\| th \| 4839 \| 11.7 \| 141.87 \| 37.16 \|
	\| tr \| 3478 \| 22.3 \| 66.77 \| 15.55 \|
	\| tt \| 13338 \| 26.7 \| 86.80 \| 33.57 \|
	\| uk \| 7271 \| 39.4 \| 70.23 \| 14.34 \|
	\| vi \| 421 \| 1.7 \| 96.06 \| 66.25 \|
	\| zh-CN \| 27284 \| 58.7 \| 89.67 \| 23.96 \|
	\| zh-HK \| 12678 \| 92.1 \| 81.77 \| 18.82 \|
	\| zh-TW \| 6402 \| 56.6 \| 85.08 \| 29.07 \|

	</details>
	# Model Examination

	More information needed

	# Environmental Impact


	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: More information needed
	- Hours used: More information needed
	- Cloud Provider: More information needed
	- Compute Region: More information needed
	- Carbon Emitted: More information needed

	# Technical Specifications [optional]

	## Model Architecture and Objective

	More information needed

	## Compute Infrastructure

	More information needed

	### Hardware

	More information needed

	### Software
	More information needed

	# Citation


	BibTeX:
	```
	More information needed
	```

	APA:
	```
	More information needed
	```

	# Glossary [optional]
	More information needed

	# More Information [optional]

	More information needed

	# Model Card Authors [optional]

	voidful in collaboration with Ezi Ozoani and the Hugging Face team

	# Model Card Contact

	More information needed

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>


	## Env setup:
	```
	!pip install torchaudio
	!pip install datasets transformers
	!pip install asrp
	!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
	```

	## Usage

	```
	import torchaudio
	from datasets import load_dataset, load_metric
	from transformers import (
	Wav2Vec2ForCTC,
	Wav2Vec2Processor,
	AutoTokenizer,
	AutoModelWithLMHead
	)
	import torch
	import re
	import sys
	import soundfile as sf
	model_name = "voidful/wav2vec2-xlsr-multilingual-56"
	device = "cuda"
	processor_name = "voidful/wav2vec2-xlsr-multilingual-56"

	import pickle
	with open("lang_ids.pk", 'rb') as output:
	lang_ids = pickle.load(output)

	model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
	processor = Wav2Vec2Processor.from_pretrained(processor_name)

	model.eval()

	def load_file_to_data(file,sampling_rate=16_000):
	batch = {}
	speech, _ = torchaudio.load(file)
	if sampling_rate != '16_000' or sampling_rate != '16000':
	resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000)
	batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
	batch["sampling_rate"] = resampler.new_freq
	else:
	batch["speech"] = speech.squeeze(0).numpy()
	batch["sampling_rate"] = '16000'
	return batch


	def predict(data):
	features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
	input_values = features.input_values.to(device)
	attention_mask = features.attention_mask.to(device)
	with torch.no_grad():
	logits = model(input_values, attention_mask=attention_mask).logits
	decoded_results = []
	for logit in logits:
	pred_ids = torch.argmax(logit, dim=-1)
	mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
	vocab_size = logit.size()[-1]
	voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
	comb_pred_ids = torch.argmax(voice_prob, dim=-1)
	decoded_results.append(processor.decode(comb_pred_ids))

	return decoded_results

	def predict_lang_specific(data,lang_code):
	features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
	input_values = features.input_values.to(device)
	attention_mask = features.attention_mask.to(device)
	with torch.no_grad():
	logits = model(input_values, attention_mask=attention_mask).logits
	decoded_results = []
	for logit in logits:
	pred_ids = torch.argmax(logit, dim=-1)
	mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size())
	vocab_size = logit.size()[-1]
	voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
	filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device)
	if len(filtered_input[0]) == 0:
	decoded_results.append("")
	else:
	lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0)
	lang_index = torch.tensor(sorted(lang_ids[lang_code]))
	lang_mask.index_fill_(0, lang_index, 1)
	lang_mask = lang_mask.to(device)
	comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1)
	decoded_results.append(processor.decode(comb_pred_ids))

	return decoded_results


	predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate

	predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate

	```

	```python
	{{ get_started_code \| default("More information needed", true)}}
	```
	</details>