saattrupdan commited on
Commit
4fe448b
1 Parent(s): 7cf1a61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -31
README.md CHANGED
@@ -3,50 +3,203 @@ library_name: transformers
3
  language:
4
  - da
5
  license: openrail
6
- base_model: facebook/wav2vec2-xls-r-300m
7
- tags:
8
- - generated_from_trainer
 
 
 
9
  model-index:
10
- - name: roest-315m-xlsr
11
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # roest-315m-xlsr
 
18
 
19
- This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on an unknown dataset.
20
 
21
- ## Model description
 
22
 
23
- More information needed
 
 
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
 
 
 
 
 
 
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
32
 
33
- ## Training procedure
 
 
 
 
34
 
35
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 0.0001
39
- - train_batch_size: 256
40
- - eval_batch_size: 256
41
- - seed: 4242
42
- - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
43
- - lr_scheduler_type: cosine
44
- - lr_scheduler_warmup_steps: 1000
45
- - training_steps: 10000
46
 
47
- ### Framework versions
48
 
49
- - Transformers 4.44.2
50
- - Pytorch 2.4.1+cu121
51
- - Datasets 3.0.0
52
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - da
5
  license: openrail
6
+ base_model: chcaa/xls-r-300m-danish
7
+ datasets:
8
+ - alexandrainst/coral
9
+ metrics:
10
+ - wer
11
+ - cer
12
  model-index:
13
+ - name: roest-315m
14
+ results:
15
+ - task:
16
+ name: Automatic Speech Recognition
17
+ type: automatic-speech-recognition
18
+ dataset:
19
+ name: CoRal read-aloud
20
+ type: alexandrainst/coral
21
+ split: test
22
+ args: read_aloud
23
+ metrics:
24
+ - name: CER
25
+ type: cer
26
+ value: 6.6% ± 0.2%
27
+ - name: WER
28
+ type: wer
29
+ value: 17.0% ± 0.4%
30
+ - task:
31
+ name: Automatic Speech Recognition
32
+ type: automatic-speech-recognition
33
+ dataset:
34
+ name: Danish Common Voice 17
35
+ type: mozilla-foundation/common_voice_17_0
36
+ split: test
37
+ args: da
38
+ metrics:
39
+ - name: CER
40
+ type: cer
41
+ value: 6.6% ± 0.6%
42
+ - name: WER
43
+ type: wer
44
+ value: 16.7% ± 0.8%
45
+ pipeline_tag: automatic-speech-recognition
46
  ---
47
 
48
+ # Røst-315m
 
49
 
50
+ This is a Danish state-of-the-art speech recognition model, trained by [the Alexandra
51
+ Institute](https://alexandra.dk/).
52
 
 
53
 
54
+ ## Quick Start
55
+ Start by installing the required libraries:
56
 
57
+ ```shell
58
+ $ pip install transformers kenlm pyctcdecode
59
+ ```
60
 
61
+ Next you can use the model using the `transformers` Python package as follows:
62
 
63
+ ```python
64
+ >>> from transformers import pipeline
65
+ >>> audio = get_audio() # 16kHz raw audio array
66
+ >>> transcriber = pipeline(model="alexandrainst/roest-315m")
67
+ >>> transcriber(audio)
68
+ {'text': 'your transcription'}
69
+ ```
70
 
 
71
 
72
+ ## Evaluation Results
73
 
74
+ We have evaluated both our and existing models on the CoRal test set as well as the
75
+ Danish Common Voice 17 test set. To ensure as robust an evaluation as possible, we have
76
+ bootstrapped the results 1000 times and report here the mean scores along with a 95%
77
+ confidence interval (lower is better; best scores in **bold**, second-best in
78
+ *italics*):
79
 
80
+ | Model | Number of parameters | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) CER | [Danish Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da/test) WER |
81
+ |:---|---:|---:|---:|---:|---:|
82
+ | Røst-315m (this model) | 315M | **6.6%** | **17.0%** | 6.6% ± 0.6% | 16.7% ± 0.8% |
83
+ | [chcaa/xls-r-300m-danish-nst-cv9](https://hf.co/chcaa/xls-r-300m-danish-nst-cv9) | 315M | 14.4% ± 0.3% | 36.5% ± 0.6% | **4.1% ± 0.5%** | **12.0% ± 0.8%** |
84
+ | [mhenrichsen/hviske](https://hf.co/mhenrichsen/hviske) | 1540M | 14.2% ± 0.5% | 33.2% ± 0.7% | *5.2% ± 0.4%* | *14.2% ± 0.8%* |
85
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | *11.4% ± 0.3%* | *28.3% ± 0.6%* | *5.5% ± 0.4%* | *14.8% ± 0.8%* |
86
+ | [openai/whisper-large-v2](https://hf.co/openai/whisper-large-v2) | 1540M | 13.9% ± 0.9% | 32.6% ± 1.2% | 7.2% ± 0.5% | 18.5% ± 0.9% |
87
+ | [openai/whisper-large](https://hf.co/openai/whisper-large) | 1540M | 14.5% ± 0.3% | 35.4% ± 0.6% | 9.2% ± 0.5% | 22.9% ± 1.0% |
88
+ | [openai/whisper-medium](https://hf.co/openai/whisper-medium) | 764M | 17.2% ± 1.3% | 40.5% ± 2.1% | 9.4% ± 0.5% | 24.0% ± 1.0% |
89
+ | [openai/whisper-small](https://hf.co/openai/whisper-small) | 242M | 23.4% ± 1.2% | 55.2% ± 2.3% | 15.9% ± 1.0% | 38.9% ± 1.2% |
90
+ | [openai/whisper-base](https://hf.co/openai/whisper-base) | 73M | 43.5% ± 3.1% | 89.3% ± 4.6% | 33.4% ± 4.7% | 71.4% ± 7.0% |
91
+ | [openai/whisper-tiny](https://hf.co/openai/whisper-tiny) | 38M | 52.0% ± 2.5% | 103.7% ± 3.5% | 42.2% ± 3.9% | 83.6% ± 2.7% |
92
 
 
 
 
 
 
 
 
 
 
93
 
94
+ ### Detailed Evaluation Across Demographics on the CoRal Test Set
95
 
96
+ ![CER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-cer-plot.png)
97
+ ![WER comparison plot](https://filedn.com/lRBwPhPxgV74tO0rDoe8SpH/coral/roest-comparison-wer-plot.png)
98
+
99
+
100
+ ## Training Data
101
+
102
+ This model is the result of four different stages of training:
103
+
104
+ 1. "Pretraining" on 436,000 hours of unlabelled multilingual publicly available data,
105
+ 13,628 hours of which is Danish. Pretraining here means that the model learnt to
106
+ "fill in" gaps of raw audio - no transcriptions were used (or available) during
107
+ this process. The pretraining data is distributed as follows:
108
+ - 372,000 hours from [VoxPopuli](https://aclanthology.org/2021.acl-long.80/), being
109
+ speeches from the European Parliament in 23 European languages.
110
+ This includes 13,600 hours of Danish speech.
111
+ - 51,000 hours from [Multilingual
112
+ LibriSpeech](https://doi.org/10.21437/Interspeech.2020-2826), being audiobooks in
113
+ 8 European languages. This does not include any Danish speech.
114
+ - 7,000 hours from [Common Voice 6](https://doi.org/10.48550/arXiv.1912.06670),
115
+ being read-aloud speech in 60 diverse languages. This does not include any Danish
116
+ speech.
117
+ - 6,600 hours from [VoxLingua107](https://doi.org/10.1109/SLT48900.2021.9383459),
118
+ being audio from YouTube videos in 107 languages. This includes 28 hours of
119
+ Danish speech.
120
+ - 1,000 hours from [BABEL](https://eprints.whiterose.ac.uk/152840/), being
121
+ conversational telephone speech in 17 African and Asian languages. This does not
122
+ include any Danish speech.
123
+ 2. "Finetuning" on 373 hours of labelled Danish publicly available data. "Finetuning"
124
+ indicates that this stage of training was supervised, i.e. the model was trained on
125
+ both audio and transcriptions to perform the speech-to-text task (also known as
126
+ automatic speech recognition). The finetuning data is as follows:
127
+ - The read-aloud training split of the [CoRal
128
+ dataset](https://huggingface.co/datasets/alexandrainst/coral) (revision
129
+ fb20199b3966d3373e0d3a5ded2c5920c70de99c), consisting of 361 hours of Danish
130
+ read-aloud speech, diverse across dialects, accents, ages and genders.
131
+ - The Danish training split of the [Common Voice 17
132
+ dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0),
133
+ consisting of 12 hours of Danish read-aloud speech.
134
+ 3. An n-gram language model has been trained separately, and is used to guide the
135
+ transcription generation of the finetuned speech recognition model. This n-gram
136
+ language model has been trained on the following datasets:
137
+ - [Danish
138
+ Wikipedia](https://huggingface.co/datasets/alexandrainst/scandi-wiki/viewer/da)
139
+ (approximately 287,000 articles).
140
+ - [Danish Common Voice 17 training
141
+ split](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/da)
142
+ (approximately 3,500 comments).
143
+ - [Danish
144
+ Reddit](https://huggingface.co/datasets/alexandrainst/scandi-reddit/viewer/da)
145
+ (approximately 5 million comments).
146
+ Note that all samples from the CoRal test dataset have been removed from all of
147
+ these datasets, to ensure that the n-gram model has not seen the test data.
148
+
149
+ The first step was trained by [Babu et al.
150
+ (2021)](https://doi.org/10.48550/arXiv.2111.09296) and the second and third step by
151
+ [Nielsen et al. (2024)](https://huggingface.co/alexandrainst/roest-315m).
152
+
153
+ The final product is then the combination of the finetuned model along with the n-gram
154
+ model, and this is what is used when you use the model as mentioned in the Quick Start
155
+ section above.
156
+
157
+
158
+ ## Intended use cases
159
+
160
+ This model is intended to be used for Danish automatic speech recognition.
161
+
162
+ Note that Biometric Identification is not allowed using the CoRal dataset and/or derived
163
+ models. For more information, see addition 4 in our
164
+ [license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
165
+
166
+
167
+ ## Why the name Røst?
168
+
169
+ Røst is both the [Danish word for the human
170
+ voice](https://ordnet.dk/ddo/ordbog?query=r%C3%B8st) as well as being the name of [one
171
+ of the cold-water coral reefs in
172
+ Scandinavia](https://da.wikipedia.org/wiki/Koralrev#Koldtvandskoralrev).
173
+
174
+
175
+ ## License
176
+ The dataset is licensed under a custom license, adapted from OpenRAIL-M, which allows
177
+ commercial use with a few restrictions (speech synthesis and biometric identification).
178
+ See
179
+ [license](https://huggingface.co/datasets/alexandrainst/roest-315m/blob/main/LICENSE).
180
+
181
+
182
+ ## Creators and Funders
183
+ The CoRal project is funded by the [Danish Innovation
184
+ Fund](https://innovationsfonden.dk/) and consists of the following partners:
185
+
186
+ - [Alexandra Institute](https://alexandra.dk/)
187
+ - [University of Copenhagen](https://www.ku.dk/)
188
+ - [Agency for Digital Government](https://digst.dk/)
189
+ - [Alvenir](https://www.alvenir.ai/)
190
+ - [Corti](https://www.corti.ai/)
191
+
192
+
193
+ ## Citation
194
+
195
+ We will submit a research paper soon, but until then, if you use this model in your
196
+ research or development, please cite it as follows:
197
+
198
+ ```bibtex
199
+ @dataset{coral2024,
200
+ author = {Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen, Anders Jess Pedersen, Anna Katrine van Zee, Anders Søgaard and Torben Blach},
201
+ title = {CoRal: A Diverse Danish ASR Dataset Covering Dialects, Accents, Genders, and Age Groups},
202
+ year = {2024},
203
+ url = {https://hf.co/datasets/alexandrainst/coral},
204
+ }
205
+ ```