anton-l HF staff commited on
Commit
59380d9
1 Parent(s): 29ee582

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - librispeech_asr
5
+ tags:
6
+ - audio
7
+ - speech
8
+ - automatic-speech-recognition
9
+ license: apache-2.0
10
+ widget:
11
+ - label: Librispeech sample 1
12
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
13
+ - label: Librispeech sample 2
14
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
15
+ ---
16
+
17
+ # SEW-D-mid
18
+
19
+ [SEW-D by ASAPP Research](https://github.com/asappresearch/sew)
20
+
21
+ The base model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note that this model should be fine-tuned on a downstream task, like Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc...
22
+
23
+ Paper: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
24
+
25
+ Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi
26
+
27
+ **Abstract**
28
+ This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.
29
+
30
+ The original model can be found under https://github.com/asappresearch/sew#model-checkpoints .
31
+
32
+ # Usage
33
+ To transcribe audio files the model can be used as a standalone acoustic model as follows:
34
+ ```python
35
+ from transformers import Wav2Vec2Processor, SEWDForCTC
36
+ from datasets import load_dataset
37
+ import soundfile as sf
38
+ import torch
39
+
40
+ # load the model and preprocessor
41
+ processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")
42
+ model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")
43
+
44
+ # load the dummy dataset with speech samples
45
+ ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
46
+
47
+ # preprocess
48
+ input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values # Batch size 1
49
+
50
+ # retrieve logits
51
+ logits = model(input_values).logits
52
+
53
+ # take argmax and decode
54
+ predicted_ids = torch.argmax(logits, dim=-1)
55
+ transcription = processor.batch_decode(predicted_ids)
56
+ ```
57
+
58
+ ## Evaluation
59
+
60
+ This code snippet shows how to evaluate **asapp/sew-d-mid-400k-ft-ls100hh** on LibriSpeech's "clean" and "other" test data.
61
+
62
+ ```python
63
+ from datasets import load_dataset
64
+ from transformers import SEWDForCTC, Wav2Vec2Processor
65
+ import torch
66
+ from jiwer import wer
67
+
68
+ librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
69
+
70
+ model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h").to("cuda")
71
+ processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")
72
+
73
+ def map_to_pred(batch):
74
+ input_values = processor(batch["audio"][0]["array"], sampling_rate=16000,
75
+ return_tensors="pt", padding="longest").input_values
76
+ with torch.no_grad():
77
+ logits = model(input_values.to("cuda")).logits
78
+
79
+ predicted_ids = torch.argmax(logits, dim=-1)
80
+ transcription = processor.batch_decode(predicted_ids)
81
+ batch["transcription"] = transcription
82
+ return batch
83
+
84
+ result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
85
+
86
+ print("WER:", wer(result["text"], result["transcription"]))
87
+ ```
88
+
89
+ *Result (WER)*:
90
+
91
+ | "clean" | "other" |
92
+ | --- | --- |
93
+ | 4.94 | 11.51 |