NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model
Features
- 🤖 An end-to-end model without intermediate text decoding
- 💪 Supports offline and streaming decoding of all modalities
- ⚡️ 28× faster inference compared to autoregressive models
Examples
We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.
- Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
- In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.
For a better experience, please wear headphones.
Chunk Size 320ms | Chunk Size 2560ms | Offline |
---|---|---|
Source Speech Transcript | Reference Text Translation |
---|---|
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné. | before the fusion of the towns rouge thier was a part of the town of louveigne |
For more examples, please check https://nast-s2x.github.io/.
Performance
- ⚡️ Lightning Fast: 28× faster inference and competitive quality in offline speech-to-speech translation
- 👩💼 Simultaneous: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
- 🤖 Unified Framework: Support end-to-end text & speech generation in one model
Check Details 👇
Architecture
- Fully Non-autoregressive: Trained with CTC-based non-monotonic latent alignment loss (Shao and Feng, 2022) and glancing mechanism (Qian et al., 2021).
- Minimum Human Design: Seamlessly switch between offline translation and simultaneous interpretation by adjusting the chunk size.
- End-to-End: Generate target speech without target text decoding.
Sources and Usage
Model
We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.
Chunk Size | checkpoint | ASR-BLEU | ASR-BLEU (Silence Removed) | Average Lagging |
---|---|---|---|---|
320ms | checkpoint | 19.67 | 24.90 | -393ms |
1280ms | checkpoint | 20.20 | 25.71 | 3330ms |
2560ms | checkpoint | 24.88 | 26.14 | 4976ms |
Offline | checkpoint | 25.82 | - | - |
Vocoder |
---|
Inference
Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.
Offline Inference
- Data preprocessing: Follow the instructions in the document.
- Generate Acoustic Unit: Excute
offline_s2u_infer.sh
- Generate Waveform: Excute
offline_wav_infer.sh
- Evaluation: Using Fairseq's ASR-BLEU evaluation toolkit
Simultaneous Inference
- We use our customized fork of
SimulEval: b43a7c
to evaluate the model in simultaneous inference. This repository is built upon the officialSimulEval: a1435b
and includes additional latency scorers. - Data preprocessing: Follow the instructions in the document.
- Streaming Generation and Evaluation: Excute
streaming_infer.sh
Train your own NAST-S2X
- Data preprocessing: Follow the instructions in the document.
- CTC Pretraining: Excute
train_ctc.sh
- NMLA Training: Excute
train_nmla.sh
Citing
Please kindly cite us if you find our papers or codes useful.
@inproceedings{
ma2024nonautoregressive,
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
},
booktitle={Proceedings of ACL 2024},
year={2024},
}
@inproceedings{
fang2024ctcs2ut,
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
},
booktitle={Findings of ACL 2024},
year={2024},
}