|
--- |
|
language: |
|
- en |
|
library_name: nemo |
|
datasets: |
|
- SLURP |
|
thumbnail: null |
|
tags: |
|
- spoken-language-understanding |
|
- speech-intent-classification |
|
- speech-slot-filling |
|
- SLURP |
|
- Conformer |
|
- Transformer |
|
- pytorch |
|
- NeMo |
|
license: cc-by-4.0 |
|
model-index: |
|
- name: slu_conformer_transformer_large_slurp |
|
results: |
|
- task: |
|
name: Slot Filling |
|
type: slot-filling |
|
dataset: |
|
name: SLURP |
|
type: slurp |
|
split: test |
|
metrics: |
|
- name: F1 |
|
type: f1 |
|
value: 82.27 |
|
- task: |
|
name: Intent Classification |
|
type: intent-classification |
|
dataset: |
|
name: SLURP |
|
type: slurp |
|
split: test |
|
metrics: |
|
- name: Accuracy |
|
type: acc |
|
value: 90.14 |
|
|
|
--- |
|
# NeMo End-to-End Speech Intent Classification and Slot Filling |
|
|
|
## Model Overview |
|
|
|
This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1]. |
|
|
|
## Model Architecture |
|
|
|
The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process. |
|
|
|
## Training |
|
|
|
The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml). |
|
|
|
The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens. |
|
|
|
Details on how to train the model can be found [here](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/speech_intent_slot/README.md). |
|
|
|
### Datasets |
|
|
|
The model is trained on the combined real and synthetic training sets of the SLURP dataset. |
|
|
|
|
|
## Performance |
|
|
|
| | | | | **Intent (Scenario_Action)** | | **Entity** | | | **SLURP Metrics** | | |
|
|-------|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------| |
|
|**Version**| **Model** | **Params (M)** | **Pretrained** | **Accuracy** | **Precision** | **Recall** | **F1** | **Precsion** | **Recall** | **F1** | |
|
|1.13.0| Conformer-Transformer-Large | 127 | NeMo ASR-Set 3.0 | 90.14 | 78.95 | 74.93 | 76.89 | 84.31 | 80.33 | 82.27 | |
|
|Baseline| Conformer-Transformer-Large | 127 | None | 72.56 | 43.19 | 43.5 | 43.34 | 53.59 | 53.92 | 53.76 | |
|
|
|
Note: during inference, we use beam size of 32, and a temperature of 1.25. |
|
|
|
|
|
## How to Use this Model |
|
|
|
The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format. |
|
|
|
### Automatically load the model from NGC |
|
|
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp") |
|
``` |
|
|
|
### Predict intents and slots with this model |
|
|
|
```shell |
|
python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \ |
|
pretrained_name="slu_conformer_transformer_slurp" \ |
|
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \ |
|
sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \ |
|
sequence_generator.beam_size="<SIZE OF BEAM>" \ |
|
sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>" |
|
``` |
|
|
|
### Input |
|
|
|
This model accepts 16000 Hz Mono-channel Audio (wav files) as input. |
|
|
|
### Output |
|
|
|
This model provides the intent and slot annotaions as a string for a given audio sample. |
|
|
|
## Limitations |
|
|
|
Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets. |
|
|
|
|
|
## References |
|
|
|
|
|
[1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205) |
|
|
|
[2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100) |
|
|
|
[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs) |
|
|
|
[4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |
|
|
|
|