alexandrainst
/

roest-315m

@@ -3,50 +3,203 @@ library_name: transformers
 language:
 - da
 license: openrail
-base_model: facebook/wav2vec2-xls-r-300m
-tags:
-- generated_from_trainer
 model-index:
-- name: roest-315m-xlsr
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# roest-315m-xlsr
-This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0001
-- train_batch_size: 256
-- eval_batch_size: 256
-- seed: 4242
-- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 1000
-- training_steps: 10000
-### Framework versions
-- Transformers 4.44.2
-- Pytorch 2.4.1+cu121
-- Datasets 3.0.0
-- Tokenizers 0.19.1

 language:
 - da
 license: openrail
+base_model: chcaa/xls-r-300m-danish
+datasets:
+- alexandrainst/coral
+metrics:
+- wer
+- cer
 model-index:
+- name: roest-315m
+  results:
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: CoRal read-aloud
+      type: alexandrainst/coral
+      split: test
+      args: read_aloud
+    metrics:
+    - name: CER
+      type: cer
+      value: 6.6% ± 0.2%
+    - name: WER
+      type: wer
+      value: 17.0% ± 0.4%
+  - task:
+      name: Automatic Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Danish Common Voice 17
+      type: mozilla-foundation/common_voice_17_0
+      split: test
+      args: da
+    metrics:
+    - name: CER
+      type: cer
+      value: 6.6% ± 0.6%
+    - name: WER
+      type: wer
+      value: 16.7% ± 0.8%
+pipeline_tag: automatic-speech-recognition
 ---
+# Røst-315m
+This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
+Institute](https://alexandra.dk/).
+## Quick Start
+Start by installing the required libraries:
+```shell
+$ pip install transformers kenlm pyctcdecode
+```
+Next you can use the model using the `transformers` Python package as follows:
+```python
+>>> from transformers import pipeline
+>>> audio = get_audio()  # 16kHz raw audio array
+>>> transcriber = pipeline(model="alexandrainst/roest-315m")
+>>> transcriber(audio)
+{'text': 'your transcription'}
+```
+## Evaluation Results
+We have evaluated both our and existing models on the CoRal test set as well as the
+Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
+bootstrapped the results 1000 times and report here the mean scores along with a 95%
+confidence interval (lower is better; best scores in **bold**, second-best in
+*italics*):
+| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
+|:---|---:|---:|---:|---:|---:|
+| Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
+| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
+| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
+| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
+| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
+| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
+| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
+| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
+| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
+| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
+### Detailed Evaluation Across Demographics on the CoRal Test Set
+![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-cer-plot.png)
+![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-wer-plot.png)
+## Training Data
+This model is the result of four different stages of training:
+  1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
+     13,628 hours of which is Danish. Pretraining here means that the model learnt to
+     "fill in" gaps of raw audio - no transcriptions were used (or available) during
+     this process. The pretraining data is distributed as follows:
+     - 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
+       speeches from the European Parliament in 23 European languages.
+       This includes 13,600 hours of Danish speech.
+     - 51,000 hours from [Multilingual
+       LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
+       8 European languages. This does not include any Danish speech.
+     - 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
+       being read-aloud speech in 60 diverse languages. This does not include any Danish
+       speech.
+     - 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
+       being audio from YouTube videos in 107 languages. This includes 28 hours of
+       Danish speech.
+     - 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
+       conversational telephone speech in 17 African and Asian languages. This does not
+       include any Danish speech.
+  2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
+     indicates that this stage of training was supervised, i.e. the model was trained on
+     both audio and transcriptions to perform the speech-to-text task (also known as
+     automatic speech recognition). The finetuning data is as follows:
+     - The read-aloud training split of the [CoRal
+       dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
+       fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
+       read-aloud speech, diverse across dialects, accents, ages and genders.
+     - The Danish training split of the [Common Voice 17
+       dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
+       consisting of 12 hours of Danish read-aloud speech.
+  3. An n-gram language model has been trained separately, and is used to guide the
+     transcription generation of the finetuned speech recognition model. This n-gram
+     language model has been trained on the following datasets:
+     - [Danish
+       Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
+       (approximately 287,000 articles).
+     - [Danish Common Voice 17 training
+       split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
+       (approximately 3,500 comments).
+     - [Danish
+       Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
+       (approximately 5 million comments).
+     Note that all samples from the CoRal test dataset have been removed from all of
+     these datasets, to ensure that the n-gram model has not seen the test data.
+The first step was trained by [Babu et al.
+(2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
+[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
+The final product is then the combination of the finetuned model along with the n-gram
+model, and this is what is used when you use the model as mentioned in the Quick Start
+section above.
+## Intended use cases
+This model is intended to be used for Danish automatic speech recognition.
+Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
+models. For more information, see addition 4 in our
+[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
+## Why the name Røst?
+Røst is both the [Danish word for the human
+voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
+of the cold-water coral reefs in
+Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
+## License
+The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
+commercial use with a few restrictions (speech synthesis and biometric identification).
+See
+[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
+## Creators and Funders
+The CoRal project is funded by the [Danish Innovation
+Fund](https://innovationsfonden.dk/) and consists of the following partners:
+- [Alexandra Institute](https://alexandra.dk/)
+- [University of Copenhagen](https://www.ku.dk/)
+- [Agency for Digital Government](https://digst.dk/)
+- [Alvenir](https://www.alvenir.ai/)
+- [Corti](https://www.corti.ai/)
+## Citation
+We will submit a research paper soon, but until then, if you use this model in your
+research or development, please cite it as follows:
+```bibtex
+@dataset{coral2024,
+  author    = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
+  title     = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
+  year      = {2024},
+  url       = {https://hf.co/datasets/alexandrainst/coral},
+}
+```