Changhan elbayadm commited on
Commit
ecaf68f
1 Parent(s): 363e672

Update README.md (#12)

Browse files

- Update README.md (8f88c5e3032c736a67d6ac696f21f3f25d2bc390)
- Update README.md (00acca2c94f2c6c8150d1cd1b858e6447b16102c)
- Update README.md (19f404348f18bbccdf8b9da95722d65cf6e27ddf)


Co-authored-by: Maha Elbayad <[email protected]>

Files changed (1) hide show
  1. README.md +156 -97
README.md CHANGED
@@ -1,11 +1,116 @@
1
  ---
2
- inference: false
3
- tags:
4
- - SeamlessM4T
5
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  library_name: fairseq2
7
  ---
8
-
9
  # SeamlessM4T Medium
10
 
11
  SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
@@ -18,14 +123,15 @@ SeamlessM4T covers:
18
 
19
  -------------------
20
 
21
- **🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).
22
- This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
 
23
 
24
  **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
25
 
26
  -------------------
27
 
28
- This is the "medium" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
29
  - Speech-to-speech translation (S2ST)
30
  - Speech-to-text translation (S2TT)
31
  - Text-to-speech translation (T2ST)
@@ -33,22 +139,23 @@ This is the "medium" variant of the unified model, which enables multiple tasks
33
  - Automatic speech recognition (ASR)
34
 
35
  ## SeamlessM4T models
36
-
37
  | Model Name | #params | checkpoint | metrics |
38
  | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
39
- | SeamlessM4T-Large | 2.3B | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_large.zip) |
40
- | SeamlessM4T-Medium | 1.2B | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_medium.zip) |
 
41
 
42
- We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-Large in the SeamlessM4T paper (as averages) in the `metrics` files above.
43
 
44
  ## 🤗 Transformers Usage
45
 
46
  First, load the processor and a checkpoint of the model:
47
 
48
  ```python
49
- >>> from transformers import AutoProcessor, SeamlessM4TModel
50
- >>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
51
- >>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
 
52
  ```
53
 
54
  You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
@@ -56,110 +163,62 @@ We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-La
56
  Here is how to use the processor to process text and audio:
57
 
58
  ```python
59
- >>> # let's load an audio sample from an Arabic speech corpus
60
- >>> from datasets import load_dataset
61
- >>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
62
- >>> audio_sample = next(iter(dataset))["audio"]
63
- >>> # now, process it
64
- >>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
65
- >>> # now, process some English test as well
66
- >>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
67
- ```
68
 
 
 
 
69
 
70
  ### Speech
71
 
72
- [`SeamlessM4TModel`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel) can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
73
 
74
  ```python
75
- >>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
76
- >>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
77
  ```
78
 
79
- With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
80
-
81
  ### Text
82
 
83
- Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
84
- This time, let's translate to French.
85
 
86
  ```python
87
- >>> # from audio
88
- >>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
89
- >>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
90
- >>> # from text
91
- >>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
92
- >>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
93
- ```
94
-
95
-
96
-
97
- ## Instructions to run inference with SeamlessM4T models
98
-
99
- The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
100
- package can be installed by following the instructions outlined here: [Installation](https://github.com/facebookresearch/seamless_communication/tree/main#installation).
101
-
102
- Once installed, a [`Translator`](https://github.com/facebookresearch/seamless_communication/blob/590547965b343b590d15847a0aa25a6779fc3753/src/seamless_communication/models/inference/translator.py#L47)
103
- object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
104
- 1. **model_name_or_card**: SeamlessM4T checkpoint. Can be either `seamlessM4T_medium` for the medium model, or `seamlessM4T_large` for the large model
105
- 2. **vocoder_name_or_card**: vocoder checkpoint (`vocoder_36langs`)
106
- 3. **device**: Torch device
107
-
108
- ```python
109
- import torch
110
- from seamless_communication.models.inference import Translator
111
 
112
-
113
- # Initialize a Translator object with a multitask model, vocoder on the GPU.
114
- translator = Translator("seamlessM4T_medium", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))
115
  ```
116
 
117
- Once instantiated, the `predict()` method can be used to run inference as many times on any of the supported tasks.
118
-
119
- Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
120
- into `<tgt_lang>` as follows.
121
-
122
- ### S2ST and T2ST:
123
 
124
- ```python
125
- # S2ST
126
- translated_text, wav, sr = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>)
127
-
128
- # T2ST
129
- translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
130
  ```
 
 
 
 
131
 
132
- Note that `<src_lang>` must be specified for T2ST.
133
-
134
- The generated units are synthesized and the output audio file is saved with:
135
-
136
- ```python
137
- wav, sr = translator.synthesize_speech(<speech_units>, <tgt_lang>)
138
-
139
- # Save the translated audio generation.
140
- torchaudio.save(
141
- <path_to_save_audio>,
142
- wav[0].cpu(),
143
- sample_rate=sr,
144
  )
145
  ```
146
 
147
- ### S2TT, T2TT and ASR:
148
-
149
- ```python
150
- # S2TT
151
- translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
152
-
153
- # ASR
154
- # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
155
- transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
156
-
157
- # T2TT
158
- translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
159
-
160
- ```
161
- Note that `<src_lang>` must be specified for T2TT.
162
-
163
  ## Citation
164
 
165
  If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
 
1
  ---
 
 
 
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bn
11
+ - bs
12
+ - bg
13
+ - ca
14
+ - cs
15
+ - zh
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - et
22
+ - fi
23
+ - fr
24
+ - or
25
+ - om
26
+ - ga
27
+ - gl
28
+ - gu
29
+ - ha
30
+ - he
31
+ - hi
32
+ - hr
33
+ - hu
34
+ - hy
35
+ - ig
36
+ - id
37
+ - is
38
+ - it
39
+ - jv
40
+ - ja
41
+ - kn
42
+ - ka
43
+ - kk
44
+ - mn
45
+ - km
46
+ - ky
47
+ - ko
48
+ - lo
49
+ - ln
50
+ - lt
51
+ - lb
52
+ - lg
53
+ - lv
54
+ - ml
55
+ - mr
56
+ - mk
57
+ - mt
58
+ - mi
59
+ - my
60
+ - nl
61
+ - nb
62
+ - ne
63
+ - ny
64
+ - oc
65
+ - pa
66
+ - ps
67
+ - fa
68
+ - pl
69
+ - pt
70
+ - ro
71
+ - ru
72
+ - sk
73
+ - sl
74
+ - sn
75
+ - sd
76
+ - so
77
+ - es
78
+ - sr
79
+ - sv
80
+ - sw
81
+ - ta
82
+ - te
83
+ - tg
84
+ - tl
85
+ - th
86
+ - tr
87
+ - uk
88
+ - ur
89
+ - uz
90
+ - vi
91
+ - wo
92
+ - xh
93
+ - yo
94
+ - ms
95
+ - zu
96
+ - ary
97
+ - arz
98
+ - yue
99
+ - kea
100
+ metrics:
101
+ - bleu
102
+ - wer
103
+ - chrf
104
+ inference: False
105
+ pipeline_tag: automatic-speech-recognition
106
+ tags:
107
+ - audio-to-audio
108
+ - text-to-speech
109
+ - speech-to-text
110
+ - text2text-generation
111
+ - seamless_communication
112
  library_name: fairseq2
113
  ---
 
114
  # SeamlessM4T Medium
115
 
116
  SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
 
123
 
124
  -------------------
125
 
126
+ **🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**
127
+
128
+ **This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
129
 
130
  **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
131
 
132
  -------------------
133
 
134
+ This is the "medium" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:
135
  - Speech-to-speech translation (S2ST)
136
  - Speech-to-text translation (S2TT)
137
  - Text-to-speech translation (T2ST)
 
139
  - Automatic speech recognition (ASR)
140
 
141
  ## SeamlessM4T models
 
142
  | Model Name | #params | checkpoint | metrics |
143
  | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
144
+ | [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip) |
145
+ | [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip) |
146
+ | [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
147
 
148
+ We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.
149
 
150
  ## 🤗 Transformers Usage
151
 
152
  First, load the processor and a checkpoint of the model:
153
 
154
  ```python
155
+ import torchaudio
156
+ from transformers import AutoProcessor, SeamlessM4TModel
157
+ processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
158
+ model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
159
  ```
160
 
161
  You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
 
163
  Here is how to use the processor to process text and audio:
164
 
165
  ```python
166
+ # Read an audio file and resample to 16kHz:
167
+ audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
168
+ audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
169
+ audio_inputs = processor(audios=audio, return_tensors="pt")
 
 
 
 
 
170
 
171
+ # Process some input text as well:
172
+ text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
173
+ ```
174
 
175
  ### Speech
176
 
177
+ Generate speech in Russian from either text (T2ST) or speech input (S2ST):
178
 
179
  ```python
180
+ audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
181
+ audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
182
  ```
183
 
 
 
184
  ### Text
185
 
186
+ Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model.
187
+ You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
188
 
189
  ```python
190
+ # from audio
191
+ output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
192
+ translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
 
194
+ # from text
195
+ output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
196
+ translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
197
  ```
198
 
199
+ ## Seamless_communication
200
+ You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md)
 
 
 
 
201
 
202
+ with either CLI:
203
+ ```bash
204
+ m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_medium
 
 
 
205
  ```
206
+ or a `Translator` API:
207
+ ```py
208
+ import torch
209
+ from seamless_communication.inference import Translator
210
 
211
+ # Initialize a Translator object with a multitask model, vocoder on the GPU.
212
+ translator = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
213
+ text_output, speech_output = translator.predict(
214
+ input=<path_to_input_audio>,
215
+ task_str="S2ST",
216
+ tgt_lang=<tgt_lang>,
217
+ text_generation_opts=text_generation_opts,
218
+ unit_generation_opts=unit_generation_opts
 
 
 
 
219
  )
220
  ```
221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
  ## Citation
223
 
224
  If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite: