File size: 4,918 Bytes
b82f66a
9a835b2
b82f66a
9a835b2
 
 
 
 
 
 
 
b82f66a
9a835b2
 
 
 
 
 
 
 
 
 
 
d167d5d
9a835b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language: id
license: apache-2.0
tags:
  - icefall
  - phoneme-recognition
  - automatic-speech-recognition
datasets:
  - mozilla-foundation/common_voice_13_0
  - indonesian-nlp/librivox-indonesia
  - google/fleurs
---

# Pruned Stateless Zipformer RNN-T Streaming ID

Pruned Stateless Zipformer RNN-T Streaming ID is an automatic speech recognition model trained on the following datasets:

- [Common Voice ID](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)
- [LibriVox Indonesia](https://huggingface.co/datasets/indonesian-nlp/librivox-indonesia)
- [FLEURS ID](https://huggingface.co/datasets/google/fleurs)

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. `['p', 'ə', 'r', 'b', 'u', 'a', 't', 'a', 'n', 'ɲ', 'a']`. Therefore, the model's [vocabulary](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/blob/main/data/lang_phone/tokens.txt) contains the different IPA phonemes found in [g2p ID](https://github.com/bookbot-kids/g2p_id).

This model was trained using [icefall](https://github.com/k2-fsa/icefall) framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/tree/main) tab, as well as the [Training metrics](https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id/tensorboard) logged via Tensorboard.

## Evaluation Results

### Simulated Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./pruned_transducer_stateless7_streaming/decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --max-duration 600 \
    --decode-chunk-len 32 \
    --decoding-method $m
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | LibriVox | FLEURS | Common Voice |
| -------------------- | :------: | :----: | :----------: |
| Greedy Search        |  4.87%   | 11.45% |    14.97%    |
| Modified Beam Search |  4.71%   | 11.25% |    14.31%    |
| Fast Beam Search     |  4.85%   | 12.55% |    14.89%    |

### Chunk-wise Streaming

```sh
for m in greedy_search fast_beam_search modified_beam_search; do
  ./pruned_transducer_stateless7_streaming/streaming_decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --decoding-method $m \
    --decode-chunk-len 32 \
    --num-decode-streams 1500
done
```

The model achieves the following phoneme error rates on the different test sets:

| Decoding             | LibriVox | FLEURS | Common Voice |
| -------------------- | :------: | :----: | :----------: |
| Greedy Search        |  5.12%   | 12.74% |    15.78%    |
| Modified Beam Search |  4.78%   | 11.83% |    14.54%    |
| Fast Beam Search     |  4.81%   | 12.93% |    14.96%    |

## Usage

### Download Pre-trained Model

```sh
cd egs/bookbot/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id
```

### Inference

To decode with greedy search, run:

```sh
./pruned_transducer_stateless7_streaming/jit_pretrained.py \
  --nn-model-filename ./tmp/pruned-transducer-stateless7-streaming-id/exp/cpu_jit.pt \
  --lang-dir ./tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone \
  ./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav
```

<details>
<summary>Decoding Output</summary>

```
2023-06-21 10:19:18,563 INFO [jit_pretrained.py:217] device: cpu
2023-06-21 10:19:19,231 INFO [lexicon.py:168] Loading pre-compiled tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone/Linv.pt
2023-06-21 10:19:19,232 INFO [jit_pretrained.py:228] Constructing Fbank computer
2023-06-21 10:19:19,233 INFO [jit_pretrained.py:238] Reading sound files: ['./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav']
2023-06-21 10:19:19,234 INFO [jit_pretrained.py:244] Decoding started
2023-06-21 10:19:20,090 INFO [jit_pretrained.py:271] 
./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav:
p u l a ŋ | s ə k o l a h | p i t ə r i | s a ŋ a t | l a p a r


2023-06-21 10:19:20,090 INFO [jit_pretrained.py:273] Decoding Done
```

</details>

## Training procedure

### Install icefall

```sh
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
```

### Prepare Data

```sh
cd egs/bookbot_id/ASR
./prepare.sh
```

### Train

```sh
export CUDA_VISIBLE_DEVICES="0"
./pruned_transducer_stateless7_streaming/train.py \
  --num-epochs 30 \
  --use-fp16 1 \
  --max-duration 400
```

## Frameworks

- [k2](https://github.com/k2-fsa/k2)
- [icefall](https://github.com/bookbot-hive/icefall)
- [lhotse](https://github.com/bookbot-hive/lhotse)