license: mit
language:
- zh
- en
metrics:
- cer
- bleu
tags:
- asr
- automatic-speech-recognition
- automatic-speech-translation
- speech-translation
- speech-recognition
MooER (ζ©θ³): an LLM-based Speech Recognition and Translation Model from Moore Threads
Online Demo: https://mooer-speech.mthreads.com:10077/
π₯ Update
We release a new model MooER-80K-v2 using 80K hours of data. Currently, MooER-80K-v2 supports the ASR task. The AST and multi-task models will be released soon.
π Introduction
We introduce MooER (ζ©θ³): an LLM-based speech recognition and translation model developed by Moore Threads. With the MooER framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of MooER is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our technical report.
For the usage of the model files, please refer to our GitHub
π₯ Evaluation Results
We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our report.
Training data
We utilize 5k hours of data (MT5K) to train our basic MooER-5K model. The data sources include:
Dataset | Duration |
---|---|
aishell2 | 137h |
librispeech | 131h |
multi_cn | 100h |
wenetspeech | 1361h |
in-house data | 3274h |
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.
Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.
At this moment, we are also developing a new model trained with 80K hours of data.
Speech Recognition
The performance of speech recognition is evaluated using WER/CER.
Language | Testset | Paraformer-large | SenseVoice-small | Qwen-audio | Whisper-large-v3 | SeamlessM4T-v2 | MooER-5K | MooER-80K | MooER-80K-v2 |
---|---|---|---|---|---|---|---|---|---|
Chinese | aishell1 | 1.93 | 3.03 | 1.43 | 7.86 | 4.09 | 1.93 | 1.25 | 1.00 |
aishell2_ios | 2.85 | 3.79 | 3.57 | 5.38 | 4.81 | 3.17 | 2.67 | 2.62 | |
test_magicdata | 3.66 | 3.81 | 5.31 | 8.36 | 9.69 | 3.48 | 2.52 | 2.17 | |
test_thchs | 3.99 | 5.17 | 4.86 | 9.06 | 7.14 | 4.11 | 3.14 | 3.00 | |
fleurs cmn_dev | 5.56 | 6.39 | 10.54 | 4.54 | 7.12 | 5.81 | 5.23 | 5.15 | |
fleurs cmn_test | 6.92 | 7.36 | 11.07 | 5.24 | 7.66 | 6.77 | 6.18 | 6.14 | |
average | 4.15 | 4.93 | 6.13 | 6.74 | 6.75 | 4.21 | 3.50 | 3.35 | |
English | librispeech test_clean | 14.15 | 4.07 | 2.15 | 3.42 | 2.77 | 7.78 | 4.11 | 3.57 |
librispeech test_other | 22.99 | 8.26 | 4.68 | 5.62 | 5.25 | 15.25 | 9.99 | 9.09 | |
fleurs eng_dev | 24.93 | 12.92 | 22.53 | 11.63 | 11.36 | 18.89 | 13.32 | 13.12 | |
fleurs eng_test | 26.81 | 13.41 | 22.51 | 12.57 | 11.82 | 20.41 | 14.97 | 14.74 | |
gigaspeech dev | 24.23 | 19.44 | 12.96 | 19.18 | 28.01 | 23.46 | 16.92 | 17.34 | |
gigaspeech test | 23.07 | 16.65 | 13.26 | 22.34 | 28.65 | 22.09 | 16.64 | 16.97 | |
average | 22.70 | 12.46 | 13.02 | 12.46 | 14.64 | 17.98 | 12.66 | 12.47 |
Speech Translation (zh -> en)
For speech translation, the performanced is evaluated using BLEU score.
Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL |
---|---|---|---|---|---|---|---|
CoVoST1 zh2en | - | 13.5 | 13.5 | - | 25.3 | - | 30.2 |
CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | 25.2 |
CCMT2019 dev | - | 15.9 | 12.0 | - | 14.8 | - | 19.6 |
π Getting Started
Please visit our GitHub for the setup and usage.
π§Ύ License
Please see the LICENSE.
π Citation
If you find MooER useful for your research, please π this repo and cite our work using the following BibTeX:
@article{liang2024mooer,
title = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads},
author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
journal = {arXiv preprint arXiv:2408.05101},
url = {https://arxiv.org/abs/2408.05101},
year = {2024}
}
π§ Contact
If you encouter any problems, feel free to create a discussion.
Moore Threads Website: https://www.mthreads.com/