NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

Features

🤖 An end-to-end model without intermediate text decoding
💪 Supports offline and streaming decoding of all modalities
⚡️ 28× faster inference compared to autoregressive models

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.

For a better experience, please wear headphones.

Chunk Size 320ms	Chunk Size 2560ms	Offline

Source Speech Transcript	Reference Text Translation
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné.	before the fusion of the towns rouge thier was a part of the town of louveigne

For more examples, please check https://nast-s2x.github.io/.

Performance

⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
👩‍💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
🤖 Unified Framework: Support end-to-end text & speech generation in one model

Check Details 👇

Offline-S2S

Simul-S2S	Simul-S2T

Architecture

Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
End-to-End: Generate target speech without target text decoding.

Sources and Usage

Model

We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.

🤗 Model card

Chunk Size	checkpoint	ASR-BLEU	ASR-BLEU (Silence Removed)	Average Lagging
320ms	checkpoint	19.67	24.90	-393ms
1280ms	checkpoint	20.20	25.71	3330ms
2560ms	checkpoint	24.88	26.14	4976ms
Offline	checkpoint	25.82	-	-

Vocoder
checkpoint

Inference

Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.

Offline Inference

Data preprocessing: Follow the instructions in the document.
Generate Acoustic Unit: Excute offline_s2u_infer.sh
Generate Waveform: Excute offline_wav_infer.sh
Evaluation: Using Fairseq's ASR-BLEU evaluation toolkit

Simultaneous Inference

We use our customized fork of SimulEval: b43a7c to evaluate the model in simultaneous inference. This repository is built upon the official SimulEval: a1435b and includes additional latency scorers.
Data preprocessing: Follow the instructions in the document.
Streaming Generation and Evaluation: Excute streaming_infer.sh

Train your own NAST-S2X

Data preprocessing: Follow the instructions in the document.
CTC Pretraining: Excute train_ctc.sh
NMLA Training: Excute train_nmla.sh

Citing

Please kindly cite us if you find our papers or codes useful.

@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}

@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}

ICTNLP
/

NAST-S2X

NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model

Features

Examples

We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.

Performance

Architecture

Sources and Usage

Model

Inference

Offline Inference

Simultaneous Inference

Train your own NAST-S2X

Citing

Dataset used to train ICTNLP/NAST-S2X