--- language: sw license: apache-2.0 tags: - icefall - phoneme-recognition - automatic-speech-recognition datasets: - bookbot/ALFFA_swahili - bookbot/fleurs_sw - bookbot/common_voice_16_1_sw --- # Pruned Stateless Zipformer RNN-T Streaming Robust SW Pruned Stateless Zipformer RNN-T Streaming Robust SW is an automatic speech recognition model trained on the following datasets: - [ALFFA Swahili](https://huggingface.co/datasets/bookbot/ALFFA_swahili) - [FLEURS Swahili](https://huggingface.co/datasets/bookbot/fleurs_sw) - [Common Voice 16.1 Swahili](https://huggingface.co/datasets/bookbot/common_voice_16_1_sw) Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `["w", "ɑ", "ʃ", "i", "ɑ"]`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [gruut](https://github.com/rhasspy/gruut). This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a NVIDIA H100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/zipformer-streaming-robust-sw/tensorboard) logged via Tensorboard. ## Evaluation Results ### Simulated Streaming ```sh for m in greedy_search fast_beam_search modified_beam_search; do ./zipformer/decode.py \ --epoch 40 \ --avg 7 \ --causal 1 \ --chunk-size 32 \ --left-context-frames 128 \ --exp-dir zipformer/exp-causal \ --use-transducer True --use-ctc True \ --decoding-method $m done ``` ```sh ./zipformer/ctc_decode.py \ --epoch 40 \ --avg 7 \ --causal 1 \ --chunk-size 32 \ --left-context-frames 128 \ --exp-dir zipformer/exp-causal \ --decoding-method ctc-decoding \ --use-transducer True --use-ctc True ``` The model achieves the following phoneme error rates on the different test sets: | Decoding | Common Voice 16.1 | FLEURS | | -------------------- | :---------------: | :----: | | Greedy Search | 7.71 | 6.58 | | Modified Beam Search | 7.53 | 6.4 | | Fast Beam Search | 7.73 | 6.61 | | CTC Greedy Search | 7.78 | 6.72 | ### Chunk-wise Streaming ```sh for m in greedy_search fast_beam_search modified_beam_search; do ./zipformer/streaming_decode.py \ --epoch 40 \ --avg 7 \ --causal 1 \ --chunk-size 32 \ --left-context-frames 128 \ --exp-dir zipformer/exp-causal \ --use-transducer True --use-ctc True \ --decoding-method $m \ --num-decode-streams 1000 done ``` The model achieves the following phoneme error rates on the different test sets: | Decoding | Common Voice 16.1 | FLEURS | | -------------------- | :---------------: | :----: | | Greedy Search | 7.75 | 6.59 | | Modified Beam Search | 7.57 | 6.37 | | Fast Beam Search | 7.72 | 6.44 | ## Usage ### Download Pre-trained Model ```sh cd egs/bookbot_sw/ASR mkdir tmp cd tmp git lfs install git clone https://huggingface.co/bookbot/zipformer-streaming-robust-sw/ ``` ### Inference To decode with greedy search, run: ```sh ./zipformer/jit_pretrained_streaming.py \ --nn-model-filename ./tmp/zipformer-streaming-robust-sw/exp-causal/jit_script_chunk_32_left_128.pt \ --tokens ./tmp/zipformer-streaming-robust-sw/data/lang_phone/tokens.txt \ ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav ```
Decoding Output ``` 2024-03-07 11:07:41,231 INFO [jit_pretrained_streaming.py:184] device: cuda:0 2024-03-07 11:07:41,865 INFO [jit_pretrained_streaming.py:197] Constructing Fbank computer 2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:200] Reading sound files: ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav 2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:205] torch.Size([125568]) 2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:207] Decoding started 2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:212] chunk_length: 64 2024-03-07 11:07:41,866 INFO [jit_pretrained_streaming.py:213] T: 77 2024-03-07 11:07:41,876 INFO [jit_pretrained_streaming.py:229] 0/130368 2024-03-07 11:07:41,877 INFO [jit_pretrained_streaming.py:229] 4000/130368 2024-03-07 11:07:41,878 INFO [jit_pretrained_streaming.py:229] 8000/130368 2024-03-07 11:07:41,879 INFO [jit_pretrained_streaming.py:229] 12000/130368 2024-03-07 11:07:42,103 INFO [jit_pretrained_streaming.py:229] 16000/130368 2024-03-07 11:07:42,104 INFO [jit_pretrained_streaming.py:229] 20000/130368 2024-03-07 11:07:42,126 INFO [jit_pretrained_streaming.py:229] 24000/130368 2024-03-07 11:07:42,127 INFO [jit_pretrained_streaming.py:229] 28000/130368 2024-03-07 11:07:42,128 INFO [jit_pretrained_streaming.py:229] 32000/130368 2024-03-07 11:07:42,151 INFO [jit_pretrained_streaming.py:229] 36000/130368 2024-03-07 11:07:42,152 INFO [jit_pretrained_streaming.py:229] 40000/130368 2024-03-07 11:07:42,175 INFO [jit_pretrained_streaming.py:229] 44000/130368 2024-03-07 11:07:42,176 INFO [jit_pretrained_streaming.py:229] 48000/130368 2024-03-07 11:07:42,177 INFO [jit_pretrained_streaming.py:229] 52000/130368 2024-03-07 11:07:42,200 INFO [jit_pretrained_streaming.py:229] 56000/130368 2024-03-07 11:07:42,201 INFO [jit_pretrained_streaming.py:229] 60000/130368 2024-03-07 11:07:42,224 INFO [jit_pretrained_streaming.py:229] 64000/130368 2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 68000/130368 2024-03-07 11:07:42,226 INFO [jit_pretrained_streaming.py:229] 72000/130368 2024-03-07 11:07:42,250 INFO [jit_pretrained_streaming.py:229] 76000/130368 2024-03-07 11:07:42,251 INFO [jit_pretrained_streaming.py:229] 80000/130368 2024-03-07 11:07:42,252 INFO [jit_pretrained_streaming.py:229] 84000/130368 2024-03-07 11:07:42,275 INFO [jit_pretrained_streaming.py:229] 88000/130368 2024-03-07 11:07:42,276 INFO [jit_pretrained_streaming.py:229] 92000/130368 2024-03-07 11:07:42,299 INFO [jit_pretrained_streaming.py:229] 96000/130368 2024-03-07 11:07:42,300 INFO [jit_pretrained_streaming.py:229] 100000/130368 2024-03-07 11:07:42,301 INFO [jit_pretrained_streaming.py:229] 104000/130368 2024-03-07 11:07:42,325 INFO [jit_pretrained_streaming.py:229] 108000/130368 2024-03-07 11:07:42,326 INFO [jit_pretrained_streaming.py:229] 112000/130368 2024-03-07 11:07:42,349 INFO [jit_pretrained_streaming.py:229] 116000/130368 2024-03-07 11:07:42,350 INFO [jit_pretrained_streaming.py:229] 120000/130368 2024-03-07 11:07:42,351 INFO [jit_pretrained_streaming.py:229] 124000/130368 2024-03-07 11:07:42,373 INFO [jit_pretrained_streaming.py:229] 128000/130368 2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:259] ./tmp/zipformer-streaming-robust-sw/test_waves/sample1.wav 2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:260] ʃiɑ|ɑᵐɓɑɔ|wɑnɑiʃi|hɑsɑ|kɑtikɑ|ɛnɛɔ|lɑ|mɑʃɑɾiki|kɑtikɑ|ufɑlmɛ|huɔ|wɛnjɛ|utɑʄiɾi|wɑ|mɑfutɑ 2024-03-07 11:07:42,374 INFO [jit_pretrained_streaming.py:262] Decoding Done ```
## Training procedure ### Install icefall ```sh git clone https://github.com/bookbot-hive/icefall cd icefall export PYTHONPATH=`pwd`:$PYTHONPATH ``` ### Prepare Data ```sh cd egs/bookbot_sw/ASR ./prepare.sh ``` ### Train ```sh export CUDA_VISIBLE_DEVICES="0" ./zipformer/train.py \ --num-epochs 40 \ --use-fp16 1 \ --exp-dir zipformer/exp-causal \ --causal 1 \ --max-duration 800 \ --use-transducer True --use-ctc True ``` ## Frameworks - [k2](https://github.com/k2-fsa/k2) - [icefall](https://github.com/bookbot-hive/icefall) - [lhotse](https://github.com/bookbot-hive/lhotse)