Papers
arxiv:2410.15017

DM-Codec: Distilling Multimodal Representations for Speech Tokenization

Published on Oct 19
· Submitted by amanchadha on Oct 22
Authors:
,
,
,
,
,

Abstract

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.

Community

Paper author Paper submitter

The paper introduces DM-Codec, a novel speech tokenizer that integrates multimodal (acoustic, semantic, and contextual) representations via language model (LM) and self-supervised speech model (SM) distillation, achieving significant improvements in speech tokenization and transcription accuracy.

  • Novel Approach: Proposes two distillation methods (LM-guided and combined LM-SM-guided) to incorporate multimodal speech representations for improved speech tokenization.
  • Performance Gains: DM-Codec outperforms state-of-the-art models in reducing Word Error Rate (WER) by up to 13.46% and improving speech quality and intelligibility on the LibriSpeech dataset.
  • Comprehensive Evaluation: Conducts extensive experiments and ablation studies, demonstrating the effectiveness of multimodal representation distillation for robust speech reconstruction.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.15017 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.15017 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.15017 in a Space README.md to link it from this page.

Collections including this paper 1