Nick256 commited on
Commit
c3dedaf
1 Parent(s): 2f23ad5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md CHANGED
@@ -1,3 +1,135 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - lg
4
+ inference: false
5
+ tags:
6
+ - Vocoder
7
+ - HiFIGAN
8
+ - text-to-speech
9
+ - TTS
10
+ - speech-synthesis
11
+ - speechbrain
12
  license: apache-2.0
13
+ datasets:
14
+ - mozilla-foundation/common_voice_16_1
15
  ---
16
+
17
+ # Vocoder with HiFIGAN trained on LJSpeech
18
+
19
+ This repository provides all the necessary tools for using a [HiFIGAN](https://arxiv.org/abs/2010.05646) vocoder trained with [LugandaSpeech](https://commonvoice.mozilla.org/lg/datasets).
20
+
21
+ The pre-trained model takes in input a spectrogram and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram.
22
+
23
+ The sampling frequency is 22050 Hz.
24
+
25
+ **NOTES**
26
+ - This vocoder model is trained on a single speaker. Although it has some ability to generalize to different speakers, for better results, we recommend using a multi-speaker vocoder like [this model trained on LibriTTS at 16,000 Hz](https://huggingface.co/speechbrain/tts-hifigan-libritts-16kHz) or [this one trained on LibriTTS at 22,050 Hz](https://huggingface.co/speechbrain/tts-hifigan-libritts-22050Hz).
27
+ - If you specifically require a vocoder with a 16,000 Hz sampling rate, please follow the provided link above for a suitable option.
28
+
29
+ ## Install SpeechBrain
30
+
31
+ ```bash
32
+ pip install speechbrain
33
+ ```
34
+
35
+
36
+ Please notice that we encourage you to read our tutorials and learn more about
37
+ [SpeechBrain](https://speechbrain.github.io).
38
+
39
+ ### Using the Vocoder
40
+
41
+ - *Basic Usage:*
42
+ ```python
43
+ import torch
44
+ from speechbrain.inference.vocoders import HIFIGAN
45
+ hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_models/tts-hifigan-commonvoice-single-female")
46
+ mel_specs = torch.rand(2, 80,298)
47
+ waveforms = hifi_gan.decode_batch(mel_specs)
48
+ ```
49
+
50
+ - *Convert a Spectrogram into a Waveform:*
51
+
52
+ ```python
53
+ import torchaudio
54
+ from speechbrain.inference.vocoders import HIFIGAN
55
+ from speechbrain.lobes.models.FastSpeech2 import mel_spectogram
56
+
57
+ # Load a pretrained HIFIGAN Vocoder
58
+ hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_models/tts-hifigan-commonvoice-single-female")
59
+
60
+ # Load an audio file (an example file can be found in this repository)
61
+ # Ensure that the audio signal is sampled at 22050 Hz; refer to the provided link for a 16 kHz Vocoder.
62
+ signal, rate = torchaudio.load('Nick256/tts-hifigan-commonvoice-single-female/example.wav')
63
+
64
+ # Compute the mel spectrogram.
65
+ # IMPORTANT: Use these specific parameters to match the Vocoder's training settings for optimal results.
66
+ spectrogram, _ = mel_spectogram(
67
+ audio=signal.squeeze(),
68
+ sample_rate=22050,
69
+ hop_length=256,
70
+ win_length=None,
71
+ n_mels=80,
72
+ n_fft=1024,
73
+ f_min=0.0,
74
+ f_max=8000.0,
75
+ power=1,
76
+ normalized=False,
77
+ min_max_energy_norm=True,
78
+ norm="slaney",
79
+ mel_scale="slaney",
80
+ compression=True
81
+ )
82
+
83
+ # Convert the spectrogram to waveform
84
+ waveforms = hifi_gan.decode_batch(spectrogram)
85
+
86
+ # Save the reconstructed audio as a waveform
87
+ torchaudio.save('waveform_reconstructed.wav', waveforms.squeeze(1), 22050)
88
+
89
+ # If everything is set up correctly, the original and reconstructed audio should be nearly indistinguishable.
90
+ # Keep in mind that this Vocoder is trained for a single speaker; for multi-speaker Vocoder options, refer to the provided links.
91
+
92
+ ```
93
+
94
+ ### Using the Vocoder with the TTS
95
+ ```python
96
+ import torchaudio
97
+ from speechbrain.inference.TTS import Tacotron2
98
+ from speechbrain.inference.vocoders import HIFIGAN
99
+
100
+ # Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
101
+ tacotron2 = Tacotron2.from_hparams(source="Nick256/tts-tacotron2-commonvoice-single-female", savedir="pretrained_models/tts-tacotron2-commonvoice-single-female")
102
+ hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_model/tts-hifigan-commonvoice-single-female")
103
+
104
+ # Running the TTS
105
+ mel_output, mel_length, alignment = tacotron2.encode_text("osiibye otya leero")
106
+
107
+ # Running Vocoder (spectrogram-to-waveform)
108
+ waveforms = hifi_gan.decode_batch(mel_output)
109
+
110
+ # Save the waverform
111
+ torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)
112
+ ```
113
+
114
+ ### Inference on GPU
115
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
116
+
117
+ ### Training
118
+ The model was trained with SpeechBrain.
119
+ To train it from scratch follow these steps:
120
+ 1. Clone SpeechBrain:
121
+ ```bash
122
+ git clone https://github.com/speechbrain/speechbrain/
123
+ ```
124
+ 2. Install it:
125
+ ```bash
126
+ cd speechbrain
127
+ pip install -r requirements.txt
128
+ pip install -e .
129
+ ```
130
+ 3. Run Training:
131
+ ```bash
132
+ cd recipes/LJSpeech/TTS/vocoder/hifi_gan/
133
+ python train.py hparams/train.yaml --data_folder /path/to/LJspeech
134
+ ```
135
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/19sLwV7nAsnUuLkoTu5vafURA9Fo2WZgG?usp=sharing).