saattrupdan
commited on
Commit
•
4fe448b
1
Parent(s):
7cf1a61
Update README.md
Browse files
README.md
CHANGED
@@ -3,50 +3,203 @@ library_name: transformers
|
|
3 |
language:
|
4 |
- da
|
5 |
license: openrail
|
6 |
-
base_model:
|
7 |
-
|
8 |
-
-
|
|
|
|
|
|
|
9 |
model-index:
|
10 |
-
- name: roest-315m
|
11 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
---
|
13 |
|
14 |
-
|
15 |
-
should probably proofread and complete it, then remove this comment. -->
|
16 |
|
17 |
-
|
|
|
18 |
|
19 |
-
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an unknown dataset.
|
20 |
|
21 |
-
##
|
|
|
22 |
|
23 |
-
|
|
|
|
|
24 |
|
25 |
-
|
26 |
|
27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
## Training and evaluation data
|
30 |
|
31 |
-
|
32 |
|
33 |
-
|
|
|
|
|
|
|
|
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
-
The following hyperparameters were used during training:
|
38 |
-
- learning_rate: 0.0001
|
39 |
-
- train_batch_size: 256
|
40 |
-
- eval_batch_size: 256
|
41 |
-
- seed: 4242
|
42 |
-
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
|
43 |
-
- lr_scheduler_type: cosine
|
44 |
-
- lr_scheduler_warmup_steps: 1000
|
45 |
-
- training_steps: 10000
|
46 |
|
47 |
-
###
|
48 |
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
language:
|
4 |
- da
|
5 |
license: openrail
|
6 |
+
base_model: chcaa/xls-r-300m-danish
|
7 |
+
datasets:
|
8 |
+
- alexandrainst/coral
|
9 |
+
metrics:
|
10 |
+
- wer
|
11 |
+
- cer
|
12 |
model-index:
|
13 |
+
- name: roest-315m
|
14 |
+
results:
|
15 |
+
- task:
|
16 |
+
name: Automatic Speech Recognition
|
17 |
+
type: automatic-speech-recognition
|
18 |
+
dataset:
|
19 |
+
name: CoRal read-aloud
|
20 |
+
type: alexandrainst/coral
|
21 |
+
split: test
|
22 |
+
args: read_aloud
|
23 |
+
metrics:
|
24 |
+
- name: CER
|
25 |
+
type: cer
|
26 |
+
value: 6.6% ± 0.2%
|
27 |
+
- name: WER
|
28 |
+
type: wer
|
29 |
+
value: 17.0% ± 0.4%
|
30 |
+
- task:
|
31 |
+
name: Automatic Speech Recognition
|
32 |
+
type: automatic-speech-recognition
|
33 |
+
dataset:
|
34 |
+
name: Danish Common Voice 17
|
35 |
+
type: mozilla-foundation/common_voice_17_0
|
36 |
+
split: test
|
37 |
+
args: da
|
38 |
+
metrics:
|
39 |
+
- name: CER
|
40 |
+
type: cer
|
41 |
+
value: 6.6% ± 0.6%
|
42 |
+
- name: WER
|
43 |
+
type: wer
|
44 |
+
value: 16.7% ± 0.8%
|
45 |
+
pipeline_tag: automatic-speech-recognition
|
46 |
---
|
47 |
|
48 |
+
# Røst-315m
|
|
|
49 |
|
50 |
+
This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
|
51 |
+
Institute](https://alexandra.dk/).
|
52 |
|
|
|
53 |
|
54 |
+
## Quick Start
|
55 |
+
Start by installing the required libraries:
|
56 |
|
57 |
+
```shell
|
58 |
+
$ pip install transformers kenlm pyctcdecode
|
59 |
+
```
|
60 |
|
61 |
+
Next you can use the model using the `transformers` Python package as follows:
|
62 |
|
63 |
+
```python
|
64 |
+
>>> from transformers import pipeline
|
65 |
+
>>> audio = get_audio() # 16kHz raw audio array
|
66 |
+
>>> transcriber = pipeline(model="alexandrainst/roest-315m")
|
67 |
+
>>> transcriber(audio)
|
68 |
+
{'text': 'your transcription'}
|
69 |
+
```
|
70 |
|
|
|
71 |
|
72 |
+
## Evaluation Results
|
73 |
|
74 |
+
We have evaluated both our and existing models on the CoRal test set as well as the
|
75 |
+
Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
|
76 |
+
bootstrapped the results 1000 times and report here the mean scores along with a 95%
|
77 |
+
confidence interval (lower is better; best scores in **bold**, second-best in
|
78 |
+
*italics*):
|
79 |
|
80 |
+
| Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
|
81 |
+
|:---|---:|---:|---:|---:|---:|
|
82 |
+
| Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
|
83 |
+
| [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
|
84 |
+
| [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
|
85 |
+
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
|
86 |
+
| [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
|
87 |
+
| [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
|
88 |
+
| [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
|
89 |
+
| [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
|
90 |
+
| [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
|
91 |
+
| [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
|
92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
|
94 |
+
### Detailed Evaluation Across Demographics on the CoRal Test Set
|
95 |
|
96 |
+
![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-cer-plot.png)
|
97 |
+
![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-wer-plot.png)
|
98 |
+
|
99 |
+
|
100 |
+
## Training Data
|
101 |
+
|
102 |
+
This model is the result of four different stages of training:
|
103 |
+
|
104 |
+
1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
|
105 |
+
13,628 hours of which is Danish. Pretraining here means that the model learnt to
|
106 |
+
"fill in" gaps of raw audio - no transcriptions were used (or available) during
|
107 |
+
this process. The pretraining data is distributed as follows:
|
108 |
+
- 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
|
109 |
+
speeches from the European Parliament in 23 European languages.
|
110 |
+
This includes 13,600 hours of Danish speech.
|
111 |
+
- 51,000 hours from [Multilingual
|
112 |
+
LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
|
113 |
+
8 European languages. This does not include any Danish speech.
|
114 |
+
- 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
|
115 |
+
being read-aloud speech in 60 diverse languages. This does not include any Danish
|
116 |
+
speech.
|
117 |
+
- 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
|
118 |
+
being audio from YouTube videos in 107 languages. This includes 28 hours of
|
119 |
+
Danish speech.
|
120 |
+
- 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
|
121 |
+
conversational telephone speech in 17 African and Asian languages. This does not
|
122 |
+
include any Danish speech.
|
123 |
+
2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
|
124 |
+
indicates that this stage of training was supervised, i.e. the model was trained on
|
125 |
+
both audio and transcriptions to perform the speech-to-text task (also known as
|
126 |
+
automatic speech recognition). The finetuning data is as follows:
|
127 |
+
- The read-aloud training split of the [CoRal
|
128 |
+
dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
|
129 |
+
fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
|
130 |
+
read-aloud speech, diverse across dialects, accents, ages and genders.
|
131 |
+
- The Danish training split of the [Common Voice 17
|
132 |
+
dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
|
133 |
+
consisting of 12 hours of Danish read-aloud speech.
|
134 |
+
3. An n-gram language model has been trained separately, and is used to guide the
|
135 |
+
transcription generation of the finetuned speech recognition model. This n-gram
|
136 |
+
language model has been trained on the following datasets:
|
137 |
+
- [Danish
|
138 |
+
Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
|
139 |
+
(approximately 287,000 articles).
|
140 |
+
- [Danish Common Voice 17 training
|
141 |
+
split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
|
142 |
+
(approximately 3,500 comments).
|
143 |
+
- [Danish
|
144 |
+
Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
|
145 |
+
(approximately 5 million comments).
|
146 |
+
Note that all samples from the CoRal test dataset have been removed from all of
|
147 |
+
these datasets, to ensure that the n-gram model has not seen the test data.
|
148 |
+
|
149 |
+
The first step was trained by [Babu et al.
|
150 |
+
(2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
|
151 |
+
[Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
|
152 |
+
|
153 |
+
The final product is then the combination of the finetuned model along with the n-gram
|
154 |
+
model, and this is what is used when you use the model as mentioned in the Quick Start
|
155 |
+
section above.
|
156 |
+
|
157 |
+
|
158 |
+
## Intended use cases
|
159 |
+
|
160 |
+
This model is intended to be used for Danish automatic speech recognition.
|
161 |
+
|
162 |
+
Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
|
163 |
+
models. For more information, see addition 4 in our
|
164 |
+
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
|
165 |
+
|
166 |
+
|
167 |
+
## Why the name Røst?
|
168 |
+
|
169 |
+
Røst is both the [Danish word for the human
|
170 |
+
voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
|
171 |
+
of the cold-water coral reefs in
|
172 |
+
Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
|
173 |
+
|
174 |
+
|
175 |
+
## License
|
176 |
+
The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
|
177 |
+
commercial use with a few restrictions (speech synthesis and biometric identification).
|
178 |
+
See
|
179 |
+
[license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
|
180 |
+
|
181 |
+
|
182 |
+
## Creators and Funders
|
183 |
+
The CoRal project is funded by the [Danish Innovation
|
184 |
+
Fund](https://innovationsfonden.dk/) and consists of the following partners:
|
185 |
+
|
186 |
+
- [Alexandra Institute](https://alexandra.dk/)
|
187 |
+
- [University of Copenhagen](https://www.ku.dk/)
|
188 |
+
- [Agency for Digital Government](https://digst.dk/)
|
189 |
+
- [Alvenir](https://www.alvenir.ai/)
|
190 |
+
- [Corti](https://www.corti.ai/)
|
191 |
+
|
192 |
+
|
193 |
+
## Citation
|
194 |
+
|
195 |
+
We will submit a research paper soon, but until then, if you use this model in your
|
196 |
+
research or development, please cite it as follows:
|
197 |
+
|
198 |
+
```bibtex
|
199 |
+
@dataset{coral2024,
|
200 |
+
author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
|
201 |
+
title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
|
202 |
+
year = {2024},
|
203 |
+
url = {https://hf.co/datasets/alexandrainst/coral},
|
204 |
+
}
|
205 |
+
```
|