chcaa
/

xls-r-300m-nst-cv9-da

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

xls-r-300m-nst-cv9-da / README.md

HLasse's picture

Update README.md

4b3c635 about 2 years ago

|

history blame contribute delete

No virus

3.08 kB

	---
	language:
	- da
	datasets:
	- common-voice-9
	- nst
	tags:
	- speech-to-text
	- hf-asr-leaderboard
	license: apache-2.0
	model-index:
	- name: xls-r-300m-nst-cv9-da
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 9.0 (Danish)
	type: mozilla-foundation/common_voice_9_0
	config: default
	split: test
	args:
	language: da
	metrics:
	- name: Test WER
	type: wer
	value: 10.8
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Alvenir ASR da eval
	type: Alvenir/alvenir_asr_da_eval
	config: default
	split: test
	args:
	language: da
	metrics:
	- name: Test WER
	type: wer
	value: 8.2
	---

	# xls-r-300m-danish-nst-cv9

	This is a version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) finetuned for Danish ASR on the training set of the public NST dataset and the Danish part of Common Voice 9. The model is trained on 16kHz, so ensure that you use the same sample rate.

	The model was trained using fairseq with [this config](https://github.com/centre-for-humanities-computing/Gjallarhorn/blob/main/fairseq_configs/finetuning/xlrs_finetune.yaml) for 120.000 steps.


	## Usage
	```Python
	import torch
	from datasets import load_dataset
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	# load model and tokenizer
	processor = Wav2Vec2Processor.from_pretrained(
	"chcaa/xls-r-300m-nst-cv9-da")
	model = Wav2Vec2ForCTC.from_pretrained(
	"chcaa/xls-r-300m-nst-cv9-da")

	# load dataset and read soundfiles
	ds = load_dataset("Alvenir/alvenir_asr_da_eval", split="test")

	# tokenize
	input_values = processor(
	ds[0]["audio"]["array"], return_tensors="pt", padding="longest"
	).input_values # Batch size 1

	# retrieve logits
	logits = model(input_values).logits

	# take argmax and decode
	predicted_ids = torch.argmax(logits, dim=-1)
	transcription = processor.batch_decode(predicted_ids)
	print(transcription)
	```

	## Performance
	The table below shows the WER rate of four different Danish ASR models on three publicly available datasets (lower is better).

	\|Model \| [Alvenir](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)\| [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/)\| [CV9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0)\|
	\|:--------------------------------------\|------:\|-----:\|-----:\|
	\|[Alvenir/wav2vec2-base-da-ft-nst](https://huggingface.co/Alvenir/wav2vec2-base-da-ft-nst) \| 0.202\| 0.099\| 0.238\|
	\|[chcaa/alvenir-wav2vec2-base-da-nst-cv9](https://huggingface.co/chcaa/alvenir-wav2vec2-base-da-nst-cv9) \| 0.233\| 0.126\| 0.256\|
	\|chcaa/xls-r-300m-nst-cv9-da \| 0.105\| 0.060\| 0.119\|
	\|[chcaa/xls-r-300m-danish-nst-cv9](https://huggingface.co/chcaa/xls-r-300m-danish-nst-cv9) \| 0.082\| 0.051\| 0.108\|

	The model was finetuned in collaboration with [Alvenir](https://alvenir.ai).