nvidia
/

slu_conformer_transformer_large_slurp

 ---
+language:
+- en
+library_name: nemo
+datasets:
+- SLURP
+thumbnail: null
+tags:
+- spoken-language-understanding
+- speech-intent-classification
+- speech-slot-filling
+- SLURP
+- Conformer
+- Transformer
+- pytorch
+- NeMo
 license: cc-by-4.0
+model-index:
+- name: slu_conformer_transformer_large_slurp
+  results:
+  - task:
+      name: Spoken Language Understanding
+      type: spoken-language-understanding
+    dataset:
+      name: SLURP
+      type: spoken-language-understanding
+      split: test
+    metrics:
+    - name: Intent Accuracy
+      type: acc
+      value: 90.14
+    - name: SLURP Precision
+      type: precision
+      value: 84.31
+    - name: SLURP Recall
+      type: recall
+      value: 80.33
+    - name: SLURP F1
+      type: f1
+      value: 82.27
 ---
+# NeMo End-to-End Speech Intent Classification and Slot Filling
+## Model Overview
+This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].
+## Model Architecture
+The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with label smoothing and teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.
+## Training
+The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml).
+The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens.
+### Datasets
+The model is trained on the combined real and synthetic training sets of the SLURP dataset.
+## Performance
+|       |                                                  |                |                          | **Intent (Scenario_Action)** |               | **Entity** |        |              | **SLURP Metrics** |                     |
+|-------|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
+|**Version**|                     **Model**                    | **Params (M)** |      **Pretrained**      |         **Accuracy**         | **Precision** | **Recall** | **F1** | **Precsion** |     **Recall**    |        **F1**       |
+|1.13.0| Conformer-Transformer-Large | 127            | NeMo ASR-Set 3.0         |                        90.14 |         78.95 |      74.93 |  76.89 |        84.31 |             80.33 |               82.27 |
+|Baseline| Conformer-Transformer-Large               | 127            | None                     |                        72.56 |         43.19 |       43.5 |  43.34 |        53.59 |             53.92 |               53.76 |
+NoDuring inference, we use beam size of 32, and a temperature of 1.25.
+## How to Use this Model
+The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.
+### Automatically load the model from NGC
+```python
+import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
+```
+### Predict intents and slots with this model
+```shell
+python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
+ pretrained_name="slu_conformer_transformer_slurp" \
+ audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
+ sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
+ sequence_generator.beam_size="<SIZE OF BEAM>" \
+ sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
+```
+### Input
+This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
+### Output
+This model provides the intent and slot annotaions as a string for a given audio sample.
+## Limitations
+Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.
+## References
+[1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)
+[2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
+[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)
+[4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+## Licence
+License to use this model is covered by the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms) unless another License/Terms Of Use/EULA is clearly specified. By downloading the public and release version of the model, you accept the terms and conditions of the NGC [TERMS OF USE](https://ngc.nvidia.com/legal/terms).