CHiME8 DASR NeMo Baseline Models

The model files in this repository are the models used in this paper The CHiME-7 Challenge: System Description and Performance of NeMo Team’s DASR System.
These models are needed to execute the CHiME8-DASR baseline CHiME8-DASR-Baseline NeMo
VAD, Diarization and ASR models are all based on NVIDIA NeMo Conversational AI Toolkits.

1. Voice Activity Detection (VAD) Model:

This model is based on NeMo MarbleNet VAD model.
For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data.
The simulated data is generated using the NeMo multi-speaker data simulator on VoxCeleb1&2 datasets
The multi-speaker data simulation results in a total of 2,000 hours of audio, of which approximately 30% is silence.
The Model training incorporates SpecAugment and noise augmentation through MUSAN noise dataset.

Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).

MSDD Reference: Park et al. (2022)
MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
- TitaNet Reference: Koluguri et al. (2022)
- TitaNet Model is included in MSDD-v2 .nemo checkpoint file.
Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
This neural model generates logit values indicating speaker existence.
Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
- LibriSpeech Reference: OpenSLR Download,LibriSpeech, Panayotov et al. (2015)
MUSAN noise is also used for adding additive background noise, focusing on music and broadband noise.

This ASR model is based on NeMo FastConformer XL model.
Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
- Model Reference: Gulati et al. (2020)
The model was initialized using a publicly available NeMo checkpoint.
- NeMo Checkpoint: NGC Model Card: Conformer Transducer XL
This model was then fine-tuned on the CHiME-7 train and dev set, which includes the CHiME-6 and Mixer6 training subsets, after processing the data through the multi-channel ASR front-end, utilizing ground-truth diarization.
- Fine-Tuning Details:
  - Fine-tuning Duration: 35,000 updates
  - Batch Size: 128

This KenLM model is trained solely on CHiME7-DASR datasets (Mixer6, CHiME6, DipCo).
We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
- SentencePiece: Kudo and Richardson (2018)
- KenLM: KenLM GitRepo
The token sets of our ASR and LM models were matched to ensure consistency.
To combine several N-gram models with equal weights, we used the OpenGrm library.
- OpenGrm: Roark et al. (2012)
MAES decoding was employed for the transducer, which accelerates the decoding process.
- MAES Decoding: Kim et al. (2020)
As expected, integrating the beam-search decoder with the language model significantly enhances the performance of the end-to-end model compared to its pure counterpart.