Update README.md

b34b1bd about 2 years ago

5.36 kB

	---
	language:
	- en
	library_name: nemo
	datasets:
	- SLURP
	thumbnail: null
	tags:
	- spoken-language-understanding
	- speech-intent-classification
	- speech-slot-filling
	- SLURP
	- Conformer
	- Transformer
	- pytorch
	- NeMo
	license: cc-by-4.0
	model-index:
	- name: slu_conformer_transformer_large_slurp
	results:
	- task:
	name: Slot Filling
	type: slot-filling
	dataset:
	name: SLURP
	type: slurp
	split: test
	metrics:
	- name: F1
	type: f1
	value: 82.27
	- task:
	name: Intent Classification
	type: intent-classification
	dataset:
	name: SLURP
	type: slurp
	split: test
	metrics:
	- name: Accuracy
	type: acc
	value: 90.14

	---
	# NeMo End-to-End Speech Intent Classification and Slot Filling

	## Model Overview

	This model performs joint intent classification and slot filling, directly from audio input. The model treats the problem as an audio-to-text problem, where the output text is the flattened string representation of the semantics annotation. The model is trained on the SLURP dataset [1].

	## Model Architecture

	The model is has an encoder-decoder architecture, where the encoder is a Conformer-Large model [2], and the decoder is a three-layer Transformer Decoder [3]. We use the Conformer encoder pretrained on NeMo ASR-Set (details [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large)), while the decoder is trained from scratch. A start-of-sentence (BOS) and an end-of-sentence (EOS) tokens are added to each sentence. The model is trained end-to-end by minimizing the negative log-likelihood loss with teacher forcing. During inference, the prediction is generated by beam search, where a BOS token is used to trigger the generation process.

	## Training

	The NeMo toolkit [4] was used for training the models for around 100 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/run_slurp_train.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/slurp/configs/conformer_transformer_large_bpe.yaml).

	The tokenizers for these models were built using the semantics annotations of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). We use a vocabulary size of 58, including the BOS, EOS and padding tokens.

	Details on how to train the model can be found [here](https://github.com/NVIDIA/NeMo/blob/main/examples/slu/speech_intent_slot/README.md).

	### Datasets

	The model is trained on the combined real and synthetic training sets of the SLURP dataset.


	## Performance

	\| \| \| \| \| Intent (Scenario_Action) \| \| Entity \| \| \| SLURP Metrics \| \|
	\|-------\|--------------------------------------------------\|----------------\|--------------------------\|------------------------------\|---------------\|------------\|--------\|--------------\|-------------------\|---------------------\|
	\|Version\| Model \| Params (M) \| Pretrained \| Accuracy \| Precision \| Recall \| F1 \| Precsion \| Recall \| F1 \|
	\|1.13.0\| Conformer-Transformer-Large \| 127 \| NeMo ASR-Set 3.0 \| 90.14 \| 78.95 \| 74.93 \| 76.89 \| 84.31 \| 80.33 \| 82.27 \|
	\|Baseline\| Conformer-Transformer-Large \| 127 \| None \| 72.56 \| 43.19 \| 43.5 \| 43.34 \| 53.59 \| 53.92 \| 53.76 \|

	Note: during inference, we use beam size of 32, and a temperature of 1.25.


	## How to Use this Model

	The model is available for use in the NeMo toolkit [3], and can be used on another dataset with the same annotation format.

	### Automatically load the model from NGC

	```python
	import nemo.collections.asr as nemo_asr
	asr_model = nemo_asr.models.SLUIntentSlotBPEModel.from_pretrained(model_name="slu_conformer_transformer_large_slurp")
	```

	### Predict intents and slots with this model

	```shell
	python [NEMO_GIT_FOLDER]/examples/slu/speech_intent_slot/eval_utils/inference.py \
	pretrained_name="slu_conformer_transformer_slurp" \
	audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
	sequence_generator.type="<'beam' OR 'greedy' FOR BEAM/GREEDY SEARCH>" \
	sequence_generator.beam_size="<SIZE OF BEAM>" \
	sequence_generator.temperature="<TEMPERATURE FOR BEAM SEARCH>"
	```

	### Input

	This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

	### Output

	This model provides the intent and slot annotaions as a string for a given audio sample.

	## Limitations

	Since this model was trained on only the SLURP dataset [1], the performance of this model might degrade on other datasets.


	## References


	[1] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)

	[2] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)

	[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)

	[4] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)