Added Model

9a835b2 over 1 year ago

4.92 kB

	---
	language: id
	license: apache-2.0
	tags:
	- icefall
	- phoneme-recognition
	- automatic-speech-recognition
	datasets:
	- mozilla-foundation/common_voice_13_0
	- indonesian-nlp/librivox-indonesia
	- google/fleurs
	---

	# Pruned Stateless Zipformer RNN-T Streaming ID

	Pruned Stateless Zipformer RNN-T Streaming ID is an automatic speech recognition model trained on the following datasets:

	- [Common Voice ID](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)
	- [LibriVox Indonesia](https://huggingface.co/datasets/indonesian-nlp/librivox-indonesia)
	- [FLEURS ID](https://huggingface.co/datasets/google/fleurs)

	Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `['p', 'ə', 'r', 'b', 'u', 'a', 't', 'a', 'n', 'ɲ', 'a']`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [g2p ID](https://github.com/bookbot-kids/g2p_id).

	This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Google Cloud Engine VM with a Tesla A100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/tensorboard) logged via Tensorboard.

	## Evaluation Results

	### Simulated Streaming

	```sh
	for m in greedy_search fast_beam_search modified_beam_search; do
	./pruned_transducer_stateless7_streaming/decode.py \
	--epoch 30 \
	--avg 9 \
	--exp-dir ./pruned_transducer_stateless7_streaming/exp \
	--max-duration 600 \
	--decode-chunk-len 32 \
	--decoding-method $m
	done
	```

	The model achieves the following phoneme error rates on the different test sets:

	\| Decoding \| LibriVox \| FLEURS \| Common Voice \|
	\| -------------------- \| :------: \| :----: \| :----------: \|
	\| Greedy Search \| 4.87% \| 11.45% \| 14.97% \|
	\| Modified Beam Search \| 4.71% \| 11.25% \| 14.31% \|
	\| Fast Beam Search \| 4.85% \| 12.55% \| 14.89% \|

	### Chunk-wise Streaming

	```sh
	for m in greedy_search fast_beam_search modified_beam_search; do
	./pruned_transducer_stateless7_streaming/streaming_decode.py \
	--epoch 30 \
	--avg 9 \
	--exp-dir ./pruned_transducer_stateless7_streaming/exp \
	--decoding-method $m \
	--decode-chunk-len 32 \
	--num-decode-streams 1500
	done
	```

	The model achieves the following phoneme error rates on the different test sets:

	\| Decoding \| LibriVox \| FLEURS \| Common Voice \|
	\| -------------------- \| :------: \| :----: \| :----------: \|
	\| Greedy Search \| 5.12% \| 12.74% \| 15.78% \|
	\| Modified Beam Search \| 4.78% \| 11.83% \| 14.54% \|
	\| Fast Beam Search \| 4.81% \| 12.93% \| 14.96% \|

	## Usage

	### Download Pre-trained Model

	```sh
	cd egs/bookbot/ASR
	mkdir tmp
	cd tmp
	git lfs install
	git clone https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id
	```

	### Inference

	To decode with greedy search, run:

	```sh
	./pruned_transducer_stateless7_streaming/jit_pretrained.py \
	--nn-model-filename ./tmp/pruned-transducer-stateless7-streaming-id/exp/cpu_jit.pt \
	--lang-dir ./tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone \
	./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav
	```

	<details>
	<summary>Decoding Output</summary>

	```
	2023-06-21 10:19:18,563 INFO [jit_pretrained.py:217] device: cpu
	2023-06-21 10:19:19,231 INFO [lexicon.py:168] Loading pre-compiled tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone/Linv.pt
	2023-06-21 10:19:19,232 INFO [jit_pretrained.py:228] Constructing Fbank computer
	2023-06-21 10:19:19,233 INFO [jit_pretrained.py:238] Reading sound files: ['./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav']
	2023-06-21 10:19:19,234 INFO [jit_pretrained.py:244] Decoding started
	2023-06-21 10:19:20,090 INFO [jit_pretrained.py:271]
	./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav:
	p u l a ŋ \| s ə k o l a h \| p i t ə r i \| s a ŋ a t \| l a p a r


	2023-06-21 10:19:20,090 INFO [jit_pretrained.py:273] Decoding Done
	```

	</details>

	## Training procedure

	### Install icefall

	```sh
	git clone https://github.com/bookbot-hive/icefall
	cd icefall
	export PYTHONPATH=`pwd`:$PYTHONPATH
	```

	### Prepare Data

	```sh
	cd egs/bookbot_id/ASR
	./prepare.sh
	```

	### Train

	```sh
	export CUDA_VISIBLE_DEVICES="0"
	./pruned_transducer_stateless7_streaming/train.py \
	--num-epochs 30 \
	--use-fp16 1 \
	--max-duration 400
	```

	## Frameworks

	- [k2](https://github.com/k2-fsa/k2)
	- [icefall](https://github.com/bookbot-hive/icefall)
	- [lhotse](https://github.com/bookbot-hive/lhotse)