metadata

language:
  - en
datasets:
  - mozilla-foundation/common_voice_13_0
  - facebook/voxpopuli
  - LIUM/tedlium
  - librispeech_asr
  - fisher_corpus
  - WSJ-0
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: tbd
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: clean
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 2.5
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 5.6
            name: Test WER
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: tedlium-v3
          type: LIUM/tedlium
          config: release1
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 6.3
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: en
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 7.3
            name: Test WER
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 13.0
          type: mozilla-foundation/common_voice_13_0
          config: en
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 12.1
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: FLEURS
          type: google/fleurs
          split: test
          args:
            language: en_us
        metrics:
          - type: wer
            value: 6.8
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Switchboard
          type: unk
          split: eval2000
          args:
            language: en
        metrics:
          - type: wer
            value: 6.8
            name: Test WER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Wall Street Journal
          type: unk
          split: eval92
          args:
            language: en
        metrics:
          - type: wer
            value: 1.3
            name: Test WER

DeCRED-base

This is a 174M encoder-decoder Ebranchformer model trained with an decoder-centric regularization technique on 6,000 hours of open-source normalised English data. It achieves Word Error Rates (WERs) comparable to openai/whisper-medium across multiple datasets with just 1/4 of the parameters.

Architecture details, training hyperparameters, and a description of the proposed technique will be added soon.

Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.

The model can be used with the pipeline class to transcribe audio files of arbitrary length.

from transformers import pipeline

model_id = "BUT-FIT/DeCRED-base"
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True)
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type.
# The warning can be ignored.
pipe.type = "seq2seq"

# Run beam search decoding with joint CTC-attention scorer
result_beam = pipe("audio.wav")

# Run greedy decoding without joint CTC-attention scorer
pipe.model.generation_config.ctc_weight = 0.0
pipe.model.generation_config.num_beams = 1

result_greedy = pipe("audio.wav")