|
---
|
|
license: mit
|
|
language:
|
|
- zh
|
|
- en
|
|
metrics:
|
|
- cer
|
|
- bleu
|
|
tags:
|
|
- asr
|
|
- automatic-speech-recognition
|
|
- automatic-speech-translation
|
|
- speech-translation
|
|
- speech-recognition
|
|
---
|
|
|
|
|
|
# MooER (ζ©θ³): an LLM-based Speech Recognition and Translation Model from Moore Threads
|
|
|
|
**Online Demo**: https://mooer-speech.mthreads.com:10077/
|
|
|
|
## π₯ Update
|
|
|
|
We release a new model *MooER-80K-v2* using 80K hours of data. Currently, *MooER-80K-v2* supports the ASR task. The AST and multi-task models will be released soon.
|
|
|
|
## π Introduction
|
|
|
|
We introduce **MooER (ζ©θ³)**: an LLM-based speech recognition and translation model developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).
|
|
|
|
For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)
|
|
|
|
<br>
|
|
<p align="center">
|
|
<img src="assets/framework.png" width="600"/>
|
|
<p>
|
|
<br>
|
|
|
|
## π₯ Evaluation Results
|
|
|
|
We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).
|
|
|
|
### Training data
|
|
|
|
We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include:
|
|
|
|
| Dataset | Duration |
|
|
|---------------|---------------|
|
|
| aishell2 | 137h |
|
|
| librispeech | 131h |
|
|
| multi_cn | 100h |
|
|
| wenetspeech | 1361h |
|
|
| in-house data | 3274h |
|
|
|
|
Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.
|
|
|
|
Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.
|
|
|
|
At this moment, we are also developing a new model trained with 80K hours of data.
|
|
|
|
### Speech Recognition
|
|
|
|
The performance of speech recognition is evaluated using WER/CER.
|
|
|
|
<table>
|
|
<tr>
|
|
<th>Language</th>
|
|
<th>Testset</th>
|
|
<th>Paraformer-large</th>
|
|
<th>SenseVoice-small</th>
|
|
<th>Qwen-audio</th>
|
|
<th>Whisper-large-v3</th>
|
|
<th>SeamlessM4T-v2</th>
|
|
<th>MooER-5K</th>
|
|
<th>MooER-80K</th>
|
|
<th>MooER-80K-v2</th>
|
|
</tr>
|
|
<tr>
|
|
<td rowspan="7">Chinese</td>
|
|
<td>aishell1</td>
|
|
<td>1.93</td>
|
|
<td>3.03</td>
|
|
<td>1.43</td>
|
|
<td>7.86</td>
|
|
<td>4.09</td>
|
|
<td>1.93</td>
|
|
<td>1.25</td>
|
|
<td>1.00</td>
|
|
</tr>
|
|
<tr>
|
|
<td>aishell2_ios</td>
|
|
<td>2.85</td>
|
|
<td>3.79</td>
|
|
<td>3.57</td>
|
|
<td>5.38</td>
|
|
<td>4.81</td>
|
|
<td>3.17</td>
|
|
<td>2.67</td>
|
|
<td>2.62</td>
|
|
</tr>
|
|
<tr>
|
|
<td>test_magicdata</td>
|
|
<td>3.66</td>
|
|
<td>3.81</td>
|
|
<td>5.31</td>
|
|
<td>8.36</td>
|
|
<td>9.69</td>
|
|
<td>3.48</td>
|
|
<td>2.52</td>
|
|
<td>2.17</td>
|
|
</tr>
|
|
<tr>
|
|
<td>test_thchs</td>
|
|
<td>3.99</td>
|
|
<td>5.17</td>
|
|
<td>4.86</td>
|
|
<td>9.06</td>
|
|
<td>7.14</td>
|
|
<td>4.11</td>
|
|
<td>3.14</td>
|
|
<td>3.00</td>
|
|
</tr>
|
|
<tr>
|
|
<td>fleurs cmn_dev</td>
|
|
<td>5.56</td>
|
|
<td>6.39</td>
|
|
<td>10.54</td>
|
|
<td>4.54</td>
|
|
<td>7.12</td>
|
|
<td>5.81</td>
|
|
<td>5.23</td>
|
|
<td>5.15</td>
|
|
</tr>
|
|
<tr>
|
|
<td>fleurs cmn_test</td>
|
|
<td>6.92</td>
|
|
<td>7.36</td>
|
|
<td>11.07</td>
|
|
<td>5.24</td>
|
|
<td>7.66</td>
|
|
<td>6.77</td>
|
|
<td>6.18</td>
|
|
<td>6.14</td>
|
|
</tr>
|
|
<tr>
|
|
<td>average</td>
|
|
<td><strong>4.15</strong></td>
|
|
<td><strong>4.93</strong></td>
|
|
<td><strong>6.13</strong></td>
|
|
<td><strong>6.74</strong></td>
|
|
<td><strong>6.75</strong></td>
|
|
<td><strong>4.21</strong></td>
|
|
<td><strong>3.50</strong></td>
|
|
<td><strong>3.35</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td rowspan="7">English</td>
|
|
<td>librispeech test_clean</td>
|
|
<td>14.15</td>
|
|
<td>4.07</td>
|
|
<td>2.15</td>
|
|
<td>3.42</td>
|
|
<td>2.77</td>
|
|
<td>7.78</td>
|
|
<td>4.11</td>
|
|
<td>3.57</td>
|
|
</tr>
|
|
<tr>
|
|
<td>librispeech test_other</td>
|
|
<td>22.99</td>
|
|
<td>8.26</td>
|
|
<td>4.68</td>
|
|
<td>5.62</td>
|
|
<td>5.25</td>
|
|
<td>15.25</td>
|
|
<td>9.99</td>
|
|
<td>9.09</td>
|
|
</tr>
|
|
<tr>
|
|
<td>fleurs eng_dev</td>
|
|
<td>24.93</td>
|
|
<td>12.92</td>
|
|
<td>22.53</td>
|
|
<td>11.63</td>
|
|
<td>11.36</td>
|
|
<td>18.89</td>
|
|
<td>13.32</td>
|
|
<td>13.12</td>
|
|
</tr>
|
|
<tr>
|
|
<td>fleurs eng_test</td>
|
|
<td>26.81</td>
|
|
<td>13.41</td>
|
|
<td>22.51</td>
|
|
<td>12.57</td>
|
|
<td>11.82</td>
|
|
<td>20.41</td>
|
|
<td>14.97</td>
|
|
<td>14.74</td>
|
|
</tr>
|
|
<tr>
|
|
<td>gigaspeech dev</td>
|
|
<td>24.23</td>
|
|
<td>19.44</td>
|
|
<td>12.96</td>
|
|
<td>19.18</td>
|
|
<td>28.01</td>
|
|
<td>23.46</td>
|
|
<td>16.92</td>
|
|
<td>17.34</td>
|
|
</tr>
|
|
<tr>
|
|
<td>gigaspeech test</td>
|
|
<td>23.07</td>
|
|
<td>16.65</td>
|
|
<td>13.26</td>
|
|
<td>22.34</td>
|
|
<td>28.65</td>
|
|
<td>22.09</td>
|
|
<td>16.64</td>
|
|
<td>16.97</td>
|
|
</tr>
|
|
<tr>
|
|
<td>average</td>
|
|
<td><strong>22.70</strong></td>
|
|
<td><strong>12.46</strong></td>
|
|
<td><strong>13.02</strong></td>
|
|
<td><strong>12.46</strong></td>
|
|
<td><strong>14.64</strong></td>
|
|
<td><strong>17.98</strong></td>
|
|
<td><strong>12.66</strong></td>
|
|
<td><strong>12.47</strong></td>
|
|
</tr>
|
|
</table>
|
|
|
|
### Speech Translation (zh -> en)
|
|
|
|
For speech translation, the performanced is evaluated using BLEU score.
|
|
|
|
| Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL |
|
|
|--------|-------------|-------------------|------------|-------------|-----------------|--------|--------------|
|
|
|CoVoST1 zh2en | - | 13.5 | 13.5 | - | 25.3 | - | **30.2** |
|
|
|CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | **25.2** |
|
|
|CCMT2019 dev | - | 15.9 | 12.0 | - | 14.8 | - | **19.6** |
|
|
|
|
|
|
## π Getting Started
|
|
|
|
Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage.
|
|
|
|
|
|
## π§Ύ License
|
|
|
|
Please see the [LICENSE](LICENSE).
|
|
|
|
|
|
## π Citation
|
|
|
|
If you find MooER useful for your research, please π this repo and cite our work using the following BibTeX:
|
|
|
|
```bibtex
|
|
@article{liang2024mooer,
|
|
title = {MooER: an LLM-based Speech Recognition and Translation Model from Moore Threads},
|
|
author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
|
|
journal = {arXiv preprint arXiv:2408.05101},
|
|
url = {https://arxiv.org/abs/2408.05101},
|
|
year = {2024}
|
|
}
|
|
```
|
|
|
|
## π§ Contact
|
|
|
|
If you encouter any problems, feel free to create a discussion.
|
|
|
|
Moore Threads Website: **https://www.mthreads.com/**
|
|
|
|
<br>
|
|
<p align="left">
|
|
<img src="assets/MTLogo.png" width="300"/>
|
|
<p>
|
|
<br> |