Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,127 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
library_name: nemo
|
5 |
+
datasets:
|
6 |
+
- SLURP
|
7 |
+
thumbnail: null
|
8 |
+
tags:
|
9 |
+
- spoken-language-understanding
|
10 |
+
- speech-intent-classification
|
11 |
+
- speech-slot-filling
|
12 |
+
- SLURP
|
13 |
+
- Conformer
|
14 |
+
- Transformer
|
15 |
+
- pytorch
|
16 |
+
- NeMo
|
17 |
license: cc-by-4.0
|
18 |
+
model-index:
|
19 |
+
- name: slu_conformer_transformer_large_slurp
|
20 |
+
results:
|
21 |
+
- task:
|
22 |
+
name: Spoken Language Understanding
|
23 |
+
type: spoken-language-understanding
|
24 |
+
dataset:
|
25 |
+
name: SLURP
|
26 |
+
type: spoken-language-understanding
|
27 |
+
split: test
|
28 |
+
metrics:
|
29 |
+
- name: Intent Accuracy
|
30 |
+
type: acc
|
31 |
+
value: 90.14
|
32 |
+
- name: SLURP Precision
|
33 |
+
type: precision
|
34 |
+
value: 84.31
|
35 |
+
- name: SLURP Recall
|
36 |
+
type: recall
|
37 |
+
value: 80.33
|
38 |
+
- name: SLURP F1
|
39 |
+
type: f1
|
40 |
+
value: 82.27
|
41 |
+
|
42 |
---
|
43 |
+
# NeMo End-to-End Speech Intent Classification and Slot Filling
|
44 |
+
|
45 |
+
## Model Overview
|
46 |
+
|
47 |
+
This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].
|
48 |
+
|
49 |
+
## Model Architecture
|
50 |
+
|
51 |
+
The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with label smoothing and teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
|
52 |
+
|
53 |
+
## Training
|
54 |
+
|
55 |
+
The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml).
|
56 |
+
|
57 |
+
The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
|
58 |
+
|
59 |
+
|
60 |
+
### Datasets
|
61 |
+
|
62 |
+
The model is trained on the combined real and synthetic training sets of the SLURP dataset.
|
63 |
+
|
64 |
+
|
65 |
+
## Performance
|
66 |
+
|
67 |
+
| | | | | **Intent (Scenario_Action)** | | **Entity** | | | **SLURP Metrics** | |
|
68 |
+
|-------|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
|
69 |
+
|**Version**| **Model** | **Params (M)** | **Pretrained** | **Accuracy** | **Precision** | **Recall** | **F1** | **Precsion** | **Recall** | **F1** |
|
70 |
+
|1.13.0| Conformer-Transformer-Large | 127 | NeMo ASR-Set 3.0 | 90.14 | 78.95 | 74.93 | 76.89 | 84.31 | 80.33 | 82.27 |
|
71 |
+
|Baseline| Conformer-Transformer-Large | 127 | None | 72.56 | 43.19 | 43.5 | 43.34 | 53.59 | 53.92 | 53.76 |
|
72 |
+
|
73 |
+
NoDuring inference, we use beam size of 32, and a temperature of 1.25.
|
74 |
+
|
75 |
+
|
76 |
+
## How to Use this Model
|
77 |
+
|
78 |
+
The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.
|
79 |
+
|
80 |
+
### Automatically load the model from NGC
|
81 |
+
|
82 |
+
```python
|
83 |
+
import nemo.collections.asr as nemo_asr
|
84 |
+
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
|
85 |
+
```
|
86 |
+
|
87 |
+
### Predict intents and slots with this model
|
88 |
+
|
89 |
+
```shell
|
90 |
+
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
|
91 |
+
pretrained_name="slu_conformer_transformer_slurp" \
|
92 |
+
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
|
93 |
+
sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
|
94 |
+
sequence_generator.beam_size="<SIZE OF BEAM>" \
|
95 |
+
sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
|
96 |
+
```
|
97 |
+
|
98 |
+
### Input
|
99 |
+
|
100 |
+
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
|
101 |
+
|
102 |
+
### Output
|
103 |
+
|
104 |
+
This model provides the intent and slot annotaions as a string for a given audio sample.
|
105 |
+
|
106 |
+
## Limitations
|
107 |
+
|
108 |
+
Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.
|
109 |
+
|
110 |
+
|
111 |
+
## References
|
112 |
+
|
113 |
+
|
114 |
+
[1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)
|
115 |
+
|
116 |
+
[2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
|
117 |
+
|
118 |
+
[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)
|
119 |
+
|
120 |
+
[4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
121 |
+
|
122 |
+
## Licence
|
123 |
+
|
124 |
+
License to use this model is covered by the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms) unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms).
|
125 |
+
|
126 |
+
|
127 |
+
|