language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: mit
pipeline_tag: automatic-speech-recognition
widget:
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: s2t-small-librispeech-asr
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: clean
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 4.3
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9
library_name: transformers
SWRA (SWARA)
SWRA (SWARA)
is a Speech to Text Transformer (S2T) model trained by @binarybardakshat for automatic speech recognition (ASR).
Model Description
SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.
How to Use
As this is a standard sequence-to-sequence transformer model, you can use the generate
method to generate the transcripts by passing the speech features to the model.
Note: The Speech2TextProcessor
object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio
package before running this example.
Note: The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece, so be sure to install those packages before running the examples.
You could either install those as extra speech dependencies with pip install transformers"[speech, sentencepiece]"
or install the packages separately with pip install torchaudio sentencepiece
.
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swra-swara")
processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swra-swara")
ds = load_dataset(
"patrickvonplaten/librispeech_asr_dummy",
"clean",
split="validation"
)
input_features = processor(
ds[0]["audio"]["array"],
sampling_rate=16_000,
return_tensors="pt"
).input_features # Batch size 1
generated_ids = model.generate(input_features=input_features)
transcription = processor.batch_decode(generated_ids)
#### Evaluation on LibriSpeech Test
The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
*"clean"* and *"other"* test dataset.
```python
from datasets import load_dataset
from evaluate import load
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
wer = load("wer")
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)
def map_to_pred(batch):
features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
input_features = features.input_features.to("cuda")
attention_mask = features.attention_mask.to("cuda")
gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
return batch
result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))
Result (WER):
"clean" | "other" |
---|---|
4.3 | 9.0 |
Training data
The S2T-SMALL-LIBRISPEECH-ASR is trained on LibriSpeech ASR Corpus, a dataset consisting of approximately 1000 hours of 16kHz read English speech.