metadata

license: apache-2.0
language:
  - ru
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: waveletdeboshir/whisper-base-ru-pruned
tags:
  - asr
  - Pytorch
  - pruned
  - finetune
  - audio
  - automatic-speech-recognition
model-index:
  - name: Whisper Base Pruned and Finetuned for Russian
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 15.0 (Russian part, test)
          type: mozilla-foundation/common_voice_15_0
          args: ru
        metrics:
          - name: WER
            type: wer
            value: 26.52
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 15.0 (Russian part, test)
          type: mozilla-foundation/common_voice_15_0
          args: ru
        metrics:
          - name: WER (without punctuation)
            type: wer
            value: 21.35
datasets:
  - mozilla-foundation/common_voice_15_0

Whisper-base-ru-pruned-ft

Model info

This is a finetuned version of pruned whisper-base model (waveletdeboshir/whisper-base-ru-pruned) for Russian language.

Model was finetuned on russian part of mozilla-foundation/common_voice_15_0 with Specaugment, Colored Noise augmentation and Noise from file augmentation.

Metrics

metric	dataset	waveletdeboshir/whisper-base-ru-pruned	waveletdeboshir/whisper-base-ru-pruned-ft
WER (without punctuation)	common_voice_15_0_test	0.3352	0.2135
WER	common_voice_15_0_test	0.4050	0.2652

Limitations

Because texts in Common Voice don't contain digits and other characters except letters and punctuation signs, model lost an ability to predict numbers and special characters.

Size

Only 10% tokens was left including special whisper tokens (no language tokens except <|ru|> and <|en|>, no timestamp tokens), 200 most popular tokens from tokenizer and 4000 most popular Russian tokens computed by tokenization of russian text corpus.

Model size is 30% less then original whisper-base:

	openai/whisper-base	waveletdeboshir/whisper-base-ru-pruned-ft
n of parameters	74 M	48 M
n of parameters (with proj_out layer)	99 M	50 M
model file size	290 Mb	193 Mb
vocab_size	51865	4207

Usage

Model can be used as an original whisper:

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> import torchaudio

>>> # load audio
>>> wav, sr = torchaudio.load("audio.wav")

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("waveletdeboshir/whisper-base-ru-pruned-ft")
>>> model = WhisperForConditionalGeneration.from_pretrained("waveletdeboshir/whisper-base-ru-pruned-ft")

>>> input_features = processor(wav[0], sampling_rate=sr, return_tensors="pt").input_features 

>>> # generate token ids
>>> predicted_ids = model.generate(input_features)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Начинаем работу.<|endoftext|>']

The context tokens can be removed from the start of the transcription by setting skip_special_tokens=True.

waveletdeboshir
/

whisper-base-ru-pruned-ft