File size: 3,863 Bytes
fdcbbf9
88649ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdcbbf9
88649ad
 
fdcbbf9
88649ad
fdcbbf9
88649ad
fdcbbf9
88649ad
 
fdcbbf9
88649ad
 
fdcbbf9
88649ad
 
 
 
 
fdcbbf9
88649ad
 
fdcbbf9
88649ad
 
 
fdcbbf9
88649ad
fdcbbf9
88649ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language:
- en
datasets:
- mozilla-foundation/common_voice_13_0
- facebook/voxpopuli
- LIUM/tedlium
- librispeech_asr
- fisher_corpus
- WSJ-0
metrics:
- wer
pipeline_tag: automatic-speech-recognition
model-index:
- name: tbd
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 3.5
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 8.1
      name: Test WER
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: tedlium-v3
      type: LIUM/tedlium
      config: release1
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 5.4
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Vox Populi
      type: facebook/voxpopuli
      config: en
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 8.3
      name: Test WER
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: Mozilla Common Voice 13.0
      type: mozilla-foundation/common_voice_13_0
      config: en
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 16.3
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS
      type: google/fleurs
      split: test
      args:
        language: en_us
    metrics:
    - type: wer
      value: 9.6
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Switchboard
      type: unk
      split: eval2000
      args:
        language: en
    metrics:
    - type: wer
      value: 9.2
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Wall Street Journal
      type: unk
      split: eval92
      args:
        language: en
    metrics:
    - type: wer
      value: 2.6
      name: Test WER
---
# DeCRED-base
This is a  **40M encoder-decoder Ebranchformer model** trained with an decoder-centric regularization technique on 6,000 hours of open-source normalised English data. 

Architecture details, training hyperparameters, and a description of the proposed technique will be added soon.

*Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.*

The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
class to transcribe audio files of arbitrary length.

```python
from transformers import pipeline

model_id = "BUT-FIT/DeCRED-small"
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True)
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type.
# The warning can be ignored.
pipe.type = "seq2seq"

# Run beam search decoding with joint CTC-attention scorer
result_beam = pipe("audio.wav")

# Run greedy decoding without joint CTC-attention scorer
pipe.model.generation_config.ctc_weight = 0.0
pipe.model.generation_config.num_beams = 1

result_greedy = pipe("audio.wav")

```