Akshat commited on
Commit
0289adf
1 Parent(s): 6ed467b

Add SWRA model

Browse files
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ datasets:
4
+ - librispeech_asr
5
+ tags:
6
+ - speech
7
+ - audio
8
+ - automatic-speech-recognition
9
+ - hf-asr-leaderboard
10
+ license: mit
11
+ pipeline_tag: automatic-speech-recognition
12
+ widget:
13
+ - example_title: Librispeech sample 1
14
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
15
+ - example_title: Librispeech sample 2
16
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
17
+ model-index:
18
+ - name: SWRA (SWARA)
19
+ results:
20
+ - task:
21
+ name: Automatic Speech Recognition
22
+ type: automatic-speech-recognition
23
+ dataset:
24
+ name: LibriSpeech (clean)
25
+ type: librispeech_asr
26
+ config: clean
27
+ split: test
28
+ args:
29
+ language: en
30
+ metrics:
31
+ - name: Test WER
32
+ type: wer
33
+ value: 4.3
34
+ - task:
35
+ name: Automatic Speech Recognition
36
+ type: automatic-speech-recognition
37
+ dataset:
38
+ name: LibriSpeech (other)
39
+ type: librispeech_asr
40
+ config: other
41
+ split: test
42
+ args:
43
+ language: en
44
+ metrics:
45
+ - name: Test WER
46
+ type: wer
47
+ value: 9.0
48
+ ---
49
+
50
+ # SWRA (SWARA)
51
+
52
+ `SWRA (SWARA)` is a Speech to Text Transformer (S2T) model trained by @binarybardakshat for automatic speech recognition (ASR). The S2T model was proposed in [this paper](https://arxiv.org/abs/2010.05171) and released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text).
53
+
54
+ ## Model Description
55
+
56
+ SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.
57
+
58
+ ## Intended Uses & Limitations
59
+
60
+ This model can be used for end-to-end speech recognition (ASR). See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look for other S2T checkpoints.
61
+
62
+ ### How to Use
63
+
64
+ As this is a standard sequence-to-sequence transformer model, you can use the `generate` method to generate the transcripts by passing the speech features to the model.
65
+
66
+ *Note: The `Speech2TextProcessor` object uses [torchaudio](https://github.com/pytorch/audio) to extract the filter bank features. Make sure to install the `torchaudio` package before running this example.*
67
+
68
+ *Note: The feature extractor depends on [torchaudio](https://github.com/pytorch/audio) and the tokenizer depends on [sentencepiece](https://github.com/google/sentencepiece), so be sure to install those packages before running the examples.*
69
+
70
+ You could either install those as extra speech dependencies with `pip install transformers"[speech, sentencepiece]"` or install the packages separately with `pip install torchaudio sentencepiece`.
71
+
72
+ ```python
73
+ import torch
74
+ from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
75
+ from datasets import load_dataset
76
+
77
+ model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swra-swara")
78
+ processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swra-swara")
79
+
80
+ ds = load_dataset(
81
+ "patrickvonplaten/librispeech_asr_dummy",
82
+ "clean",
83
+ split="validation"
84
+ )
85
+
86
+ input_features = processor(
87
+ ds[0]["audio"]["array"],
88
+ sampling_rate=16_000,
89
+ return_tensors="pt"
90
+ ).input_features # Batch size 1
91
+ generated_ids = model.generate(input_features=input_features)
92
+
93
+ transcription = processor.batch_decode(generated_ids)
94
+
95
+ #### Evaluation on LibriSpeech Test
96
+
97
+ The following script shows how to evaluate this model on the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)
98
+ *"clean"* and *"other"* test dataset.
99
+
100
+ ```python
101
+ from datasets import load_dataset
102
+ from evaluate import load
103
+ from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
104
+
105
+ librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") # change to "other" for other test dataset
106
+ wer = load("wer")
107
+
108
+ model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
109
+ processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)
110
+
111
+ def map_to_pred(batch):
112
+ features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
113
+ input_features = features.input_features.to("cuda")
114
+ attention_mask = features.attention_mask.to("cuda")
115
+
116
+ gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
117
+ batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
118
+ return batch
119
+
120
+ result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
121
+
122
+ print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))
123
+ ```
124
+
125
+ *Result (WER)*:
126
+
127
+ | "clean" | "other" |
128
+ |:-------:|:-------:|
129
+ | 4.3 | 9.0 |
130
+
131
+
132
+
133
+ ## Training data
134
+
135
+ The S2T-SMALL-LIBRISPEECH-ASR is trained on [LibriSpeech ASR Corpus](https://www.openslr.org/12), a dataset consisting of
136
+ approximately 1000 hours of 16kHz read English speech.
137
+
138
+
139
+ ## Training procedure
140
+
141
+ ### Preprocessing
142
+
143
+ The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from
144
+ WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization)
145
+ is applied to each example.
146
+
147
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
148
+
149
+
150
+ ### Training
151
+
152
+ The model is trained with standard autoregressive cross-entropy loss and using [SpecAugment](https://arxiv.org/abs/1904.08779).
153
+ The encoder receives speech features, and the decoder generates the transcripts autoregressively.
154
+
155
+
156
+ ### BibTeX entry and citation info
157
+
158
+ ```bibtex
159
+ @inproceedings{wang2020fairseqs2t,
160
+ title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
161
+ author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
162
+ booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
163
+ year = {2020},
164
+ }
165
+
166
+ ```
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "hf_models_fb/s2t-small-librispeech-asr",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "relu",
5
+ "architectures": [
6
+ "Speech2TextForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "conv_channels": 1024,
12
+ "conv_kernel_sizes": [
13
+ 5,
14
+ 5
15
+ ],
16
+ "d_model": 256,
17
+ "decoder_attention_heads": 4,
18
+ "decoder_ffn_dim": 2048,
19
+ "decoder_layerdrop": 0.0,
20
+ "decoder_layers": 6,
21
+ "decoder_start_token_id": 2,
22
+ "dropout": 0.1,
23
+ "early_stopping": true,
24
+ "encoder_attention_heads": 4,
25
+ "encoder_ffn_dim": 2048,
26
+ "encoder_layerdrop": 0.0,
27
+ "encoder_layers": 12,
28
+ "eos_token_id": 2,
29
+ "gradient_checkpointing": false,
30
+ "init_std": 0.02,
31
+ "input_channels": 1,
32
+ "input_feat_per_channel": 80,
33
+ "is_encoder_decoder": true,
34
+ "max_length": 200,
35
+ "max_source_positions": 6000,
36
+ "max_target_positions": 1024,
37
+ "model_type": "speech_to_text",
38
+ "num_beams": 5,
39
+ "num_conv_layers": 2,
40
+ "num_hidden_layers": 12,
41
+ "pad_token_id": 1,
42
+ "scale_embedding": true,
43
+ "transformers_version": "4.4.0.dev0",
44
+ "use_cache": true,
45
+ "vocab_size": 10000
46
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "decoder_start_token_id": 2,
5
+ "early_stopping": true,
6
+ "eos_token_id": 2,
7
+ "max_length": 200,
8
+ "num_beams": 5,
9
+ "pad_token_id": 1,
10
+ "transformers_version": "4.27.0.dev0"
11
+ }
gitattributes ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d2b5fd0d9072cf00d3599363653a91f725d24e50a6b9ece8e4cb0837ba1969f
3
+ size 118185584
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_ceptral_normalize": true,
3
+ "feature_size": 80,
4
+ "normalize_means": true,
5
+ "normalize_vars": true,
6
+ "num_mel_bins": 80,
7
+ "padding_side": "right",
8
+ "padding_value": 0.0,
9
+ "return_attention_mask": true,
10
+ "sampling_rate": 16000
11
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95be85b800e626fa6063bf30bd40874b3a426fc12b0393b7046546e470fcc535
3
+ size 118267196
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:052a168787a9160b4b2ba54e4995e9600298812c34191ca3f70cea51cd4f5c1e
3
+ size 416684
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2faac8e98aaa3808196dab18955801120c7aab1c6d4d17ea788fefd1cd37aaa8
3
+ size 128800472
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "do_upper_case": false, "do_lower_case": true, "tgt_lang": null, "lang_codes": null, "special_tokens_map_file": "/home/suraj/.cache/huggingface/transformers/f39f1499e9c4d2b3e803e3cad8a31c4cf3e626e1c69197d4cd6921e5c07007f9.9d6cd81ef646692fb1c169a880161ea1cb95f49694f220aced9b704b457e51dd", "tokenizer_file": null, "name_or_path": "hf_models_fb/s2t-small-librispeech-asr"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff