游雁 commited on
Commit
7904416
1 Parent(s): b0e99c8
Files changed (1) hide show
  1. README.md +195 -0
README.md CHANGED
@@ -3,3 +3,198 @@ license: other
3
  license_name: model-license
4
  license_link: https://github.com/alibaba-damo-academy/FunASR
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license_name: model-license
4
  license_link: https://github.com/alibaba-damo-academy/FunASR
5
  ---
6
+
7
+
8
+ # FunASR: A Fundamental End-to-End Speech Recognition Toolkit
9
+
10
+
11
+ [![PyPI](https://img.shields.io/pypi/v/funasr)](https://pypi.org/project/funasr/)
12
+
13
+
14
+ <strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!
15
+
16
+ [**Highlights**](#highlights)
17
+ | [**News**](https://github.com/alibaba-damo-academy/FunASR#whats-new)
18
+ | [**Installation**](#installation)
19
+ | [**Quick Start**](#quick-start)
20
+ | [**Runtime**](./runtime/readme.md)
21
+ | [**Model Zoo**](#model-zoo)
22
+ | [**Contact**](#contact)
23
+
24
+
25
+ <a name="highlights"></a>
26
+ ## Highlights
27
+ - FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
28
+ - We have released a vast collection of academic and industrial pretrained models on the [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) and [huggingface](https://huggingface.co/FunASR), which can be accessed through our [Model Zoo](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md). The representative [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the [service deployment document](runtime/readme_cn.md).
29
+
30
+
31
+ <a name="whats-new"></a>
32
+ ## What's new:
33
+ - 2024/01/30:funasr-1.0 has been released ([docs](https://github.com/alibaba-damo-academy/FunASR/discussions/1319))
34
+ - 2024/01/30:emotion recognition models are new supported. [model link](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary), modified from [repo](https://github.com/ddlBoJack/emotion2vec).
35
+ - 2024/01/25: Offline File Transcription Service 4.2, Offline File Transcription Service of English 1.3 released,optimized the VAD (Voice Activity Detection) data processing method, significantly reducing peak memory usage, memory leak optimization; Real-time Transcription Service 1.7 released,optimizatized the client-side;([docs](runtime/readme.md))
36
+ - 2024/01/09: The Funasr SDK for Windows version 2.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin 4.1, The offline file transcription service (CPU) of English 1.2, The real-time transcription service (CPU) of Mandarin 1.6. For more details, please refer to the official documentation or release notes([FunASR-Runtime-Windows](https://www.modelscope.cn/models/damo/funasr-runtime-win-cpu-x64/summary))
37
+ - 2024/01/03: File Transcription Service 4.0 released, Added support for 8k models, optimized timestamp mismatch issues and added sentence-level timestamps, improved the effectiveness of English word FST hotwords, supported automated configuration of thread parameters, and fixed known crash issues as well as memory leak problems, refer to ([docs](runtime/readme.md#file-transcription-service-mandarin-cpu)).
38
+ - 2024/01/03: Real-time Transcription Service 1.6 released,The 2pass-offline mode supports Ngram language model decoding and WFST hotwords, while also addressing known crash issues and memory leak problems, ([docs](runtime/readme.md#the-real-time-transcription-service-mandarin-cpu))
39
+ - 2024/01/03: Fixed known crash issues as well as memory leak problems, ([docs](runtime/readme.md#file-transcription-service-english-cpu)).
40
+ - 2023/12/04: The Funasr SDK for Windows version 1.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin, The offline file transcription service (CPU) of English, The real-time transcription service (CPU) of Mandarin. For more details, please refer to the official documentation or release notes([FunASR-Runtime-Windows](https://www.modelscope.cn/models/damo/funasr-runtime-win-cpu-x64/summary))
41
+ - 2023/11/08: The offline file transcription service 3.0 (CPU) of Mandarin has been released, adding punctuation large model, Ngram language model, and wfst hot words. For detailed information, please refer to [docs](runtime#file-transcription-service-mandarin-cpu).
42
+ - 2023/10/17: The offline file transcription service (CPU) of English has been released. For more details, please refer to ([docs](runtime#file-transcription-service-english-cpu)).
43
+ - 2023/10/13: [SlideSpeech](https://slidespeech.github.io/): A large scale multi-modal audio-visual corpus with a significant amount of real-time synchronized slides.
44
+ - 2023/10/10: The ASR-SpeakersDiarization combined pipeline [Paraformer-VAD-SPK](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr_vad_spk/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn/demo.py) is now released. Experience the model to get recognition results with speaker information.
45
+ - 2023/10/07: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec): A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec.
46
+ - 2023/09/01: The offline file transcription service 2.0 (CPU) of Mandarin has been released, with added support for ffmpeg, timestamp, and hotword models. For more details, please refer to ([docs](runtime#file-transcription-service-mandarin-cpu)).
47
+ - 2023/08/07: The real-time transcription service (CPU) of Mandarin has been released. For more details, please refer to ([docs](runtime#the-real-time-transcription-service-mandarin-cpu)).
48
+ - 2023/07/17: BAT is released, which is a low-latency and low-memory-consumption RNN-T model. For more details, please refer to ([BAT](egs/aishell/bat)).
49
+ - 2023/06/26: ASRU2023 Multi-Channel Multi-Party Meeting Transcription Challenge 2.0 completed the competition and announced the results. For more details, please refer to ([M2MeT2.0](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)).
50
+
51
+
52
+ <a name="Installation"></a>
53
+ ## Installation
54
+
55
+ ```shell
56
+ pip3 install -U funasr
57
+ ```
58
+ Or install from source code
59
+ ``` sh
60
+ git clone https://github.com/alibaba/FunASR.git && cd FunASR
61
+ pip3 install -e ./
62
+ ```
63
+ Install modelscope for the pretrained models (Optional)
64
+
65
+ ```shell
66
+ pip3 install -U modelscope
67
+ ```
68
+
69
+ ## Model Zoo
70
+ FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().
71
+
72
+ (Note: 🤗 represents the Huggingface model zoo link, ⭐ represents the ModelScope model zoo link)
73
+
74
+
75
+ | Model Name | Task Details | Training Data | Parameters |
76
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:--------------------------------:|:----------:|
77
+ | paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) [🤗]() ) | speech recognition, with timestamps, non-streaming | 60000 hours, Mandarin | 220M |
78
+ | <nobr>paraformer-zh-online <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗]() )</nobr> | speech recognition, streaming | 60000 hours, Mandarin | 220M |
79
+ | paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗]() ) | speech recognition, with timestamps, non-streaming | 50000 hours, English | 220M |
80
+ | conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗]() ) | speech recognition, non-streaming | 50000 hours, English | 220M |
81
+ | ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗]() ) | punctuation restoration | 100M, Mandarin and English | 1.1G |
82
+ | fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗]() ) | voice activity detection | 5000 hours, Mandarin and English | 0.4M |
83
+ | fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗]() ) | timestamp prediction | 5000 hours, Mandarin | 38M |
84
+ | cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗]() ) | speaker verification/diarization | 5000 hours | 7.2M |
85
+
86
+
87
+
88
+
89
+ [//]: # ()
90
+ [//]: # (FunASR supports pre-trained or further fine-tuned models for deployment as a service. The CPU version of the Chinese offline file conversion service has been released, details can be found in [docs]&#40;funasr/runtime/docs/SDK_tutorial.md&#41;. More detailed information about service deployment can be found in the [deployment roadmap]&#40;funasr/runtime/readme_cn.md&#41;.)
91
+
92
+
93
+ <a name="quick-start"></a>
94
+ ## Quick Start
95
+
96
+ Below is a quick start tutorial. Test audio files ([Mandarin](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav), [English]()).
97
+
98
+ ### Command-line usage
99
+
100
+ ```shell
101
+ funasr +model=paraformer-zh +vad_model="fsmn-vad" +punc_model="ct-punc" +input=asr_example_zh.wav
102
+ ```
103
+
104
+ Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: `wav_id wav_pat`
105
+
106
+ ### Speech Recognition (Non-streaming)
107
+ ```python
108
+ from funasr import AutoModel
109
+ # paraformer-zh is a multi-functional asr model
110
+ # use vad, punc, spk or not as you need
111
+ model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
112
+ vad_model="fsmn-vad", vad_model_revision="v2.0.4",
113
+ punc_model="ct-punc-c", punc_model_revision="v2.0.4",
114
+ # spk_model="cam++", spk_model_revision="v2.0.2",
115
+ )
116
+ res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
117
+ batch_size_s=300,
118
+ hotword='魔搭')
119
+ print(res)
120
+ ```
121
+ Note: `model_hub`: represents the model repository, `ms` stands for selecting ModelScope download, `hf` stands for selecting Huggingface download.
122
+
123
+ ### Speech Recognition (Streaming)
124
+ ```python
125
+ from funasr import AutoModel
126
+
127
+ chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
128
+ encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
129
+ decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
130
+
131
+ model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
132
+
133
+ import soundfile
134
+ import os
135
+
136
+ wav_file = os.path.join(model.model_path, "example/asr_example.wav")
137
+ speech, sample_rate = soundfile.read(wav_file)
138
+ chunk_stride = chunk_size[1] * 960 # 600ms
139
+
140
+ cache = {}
141
+ total_chunk_num = int(len((speech)-1)/chunk_stride+1)
142
+ for i in range(total_chunk_num):
143
+ speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
144
+ is_final = i == total_chunk_num - 1
145
+ res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
146
+ print(res)
147
+ ```
148
+ Note: `chunk_size` is the configuration for streaming latency.` [0,10,5]` indicates that the real-time display granularity is `10*60=600ms`, and the lookahead information is `5*60=300ms`. Each inference input is `600ms` (sample points are `16000*0.6=960`), and the output is the corresponding text. For the last speech segment input, `is_final=True` needs to be set to force the output of the last word.
149
+
150
+ ### Voice Activity Detection (Non-Streaming)
151
+ ```python
152
+ from funasr import AutoModel
153
+
154
+ model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
155
+ wav_file = f"{model.model_path}/example/asr_example.wav"
156
+ res = model.generate(input=wav_file)
157
+ print(res)
158
+ ```
159
+ ### Voice Activity Detection (Streaming)
160
+ ```python
161
+ from funasr import AutoModel
162
+
163
+ chunk_size = 200 # ms
164
+ model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
165
+
166
+ import soundfile
167
+
168
+ wav_file = f"{model.model_path}/example/vad_example.wav"
169
+ speech, sample_rate = soundfile.read(wav_file)
170
+ chunk_stride = int(chunk_size * sample_rate / 1000)
171
+
172
+ cache = {}
173
+ total_chunk_num = int(len((speech)-1)/chunk_stride+1)
174
+ for i in range(total_chunk_num):
175
+ speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
176
+ is_final = i == total_chunk_num - 1
177
+ res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
178
+ if len(res[0]["value"]):
179
+ print(res)
180
+ ```
181
+ ### Punctuation Restoration
182
+ ```python
183
+ from funasr import AutoModel
184
+
185
+ model = AutoModel(model="ct-punc", model_revision="v2.0.4")
186
+ res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
187
+ print(res)
188
+ ```
189
+ ### Timestamp Prediction
190
+ ```python
191
+ from funasr import AutoModel
192
+
193
+ model = AutoModel(model="fa-zh", model_revision="v2.0.4")
194
+ wav_file = f"{model.model_path}/example/asr_example.wav"
195
+ text_file = f"{model.model_path}/example/text.txt"
196
+ res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
197
+ print(res)
198
+ ```
199
+
200
+ More examples ref to [docs](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)