File size: 4,797 Bytes
2f23ad5
c3dedaf
 
42ff7f4
c3dedaf
 
 
 
 
 
 
2f23ad5
c3dedaf
 
2f23ad5
c3dedaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
language:
- lg
inference: true
tags:
- Vocoder
- HiFIGAN
- text-to-speech
- TTS
- speech-synthesis
- speechbrain
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_16_1
---

# Vocoder with HiFIGAN trained on LJSpeech

This repository provides all the necessary tools for using a [HiFIGAN](https://arxiv.org/abs/2010.05646) vocoder trained with [LugandaSpeech](https://commonvoice.mozilla.org/lg/datasets). 

The pre-trained model takes in input a spectrogram and produces a waveform in output. Typically, a vocoder is used after a TTS model that converts an input text into a spectrogram.

The sampling frequency is 22050 Hz.

**NOTES**
- This vocoder model is trained on a single speaker. Although it has some ability to generalize to different speakers, for better results, we recommend using a multi-speaker vocoder like [this model trained on LibriTTS at 16,000 Hz](https://huggingface.co/speechbrain/tts-hifigan-libritts-16kHz) or [this one trained on LibriTTS at 22,050 Hz](https://huggingface.co/speechbrain/tts-hifigan-libritts-22050Hz).
- If you specifically require a vocoder with a 16,000 Hz sampling rate, please follow the provided link above for a suitable option.

## Install SpeechBrain

```bash
pip install speechbrain
```


Please notice that we encourage you to read our tutorials and learn more about
[SpeechBrain](https://speechbrain.github.io).

### Using the Vocoder

- *Basic Usage:*
```python
import torch
from speechbrain.inference.vocoders import HIFIGAN
hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_models/tts-hifigan-commonvoice-single-female")
mel_specs = torch.rand(2, 80,298)
waveforms = hifi_gan.decode_batch(mel_specs)
```

- *Convert a Spectrogram into a Waveform:*

```python
import torchaudio
from speechbrain.inference.vocoders import HIFIGAN
from speechbrain.lobes.models.FastSpeech2 import mel_spectogram

# Load a pretrained HIFIGAN Vocoder
hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_models/tts-hifigan-commonvoice-single-female")

# Load an audio file (an example file can be found in this repository)
# Ensure that the audio signal is sampled at 22050 Hz; refer to the provided link for a 16 kHz Vocoder.
signal, rate = torchaudio.load('Nick256/tts-hifigan-commonvoice-single-female/example.wav')

# Compute the mel spectrogram.
# IMPORTANT: Use these specific parameters to match the Vocoder's training settings for optimal results.
spectrogram, _ = mel_spectogram(
    audio=signal.squeeze(),
    sample_rate=22050,
    hop_length=256,
    win_length=None,
    n_mels=80,
    n_fft=1024,
    f_min=0.0,
    f_max=8000.0,
    power=1,
    normalized=False,
    min_max_energy_norm=True,
    norm="slaney",
    mel_scale="slaney",
    compression=True
)

# Convert the spectrogram to waveform
waveforms = hifi_gan.decode_batch(spectrogram)

# Save the reconstructed audio as a waveform
torchaudio.save('waveform_reconstructed.wav', waveforms.squeeze(1), 22050)

# If everything is set up correctly, the original and reconstructed audio should be nearly indistinguishable.
# Keep in mind that this Vocoder is trained for a single speaker; for multi-speaker Vocoder options, refer to the provided links.

```

### Using the Vocoder with the TTS
```python
import torchaudio
from speechbrain.inference.TTS import Tacotron2
from speechbrain.inference.vocoders import HIFIGAN

# Intialize TTS (tacotron2) and Vocoder (HiFIGAN)
tacotron2 = Tacotron2.from_hparams(source="Nick256/tts-tacotron2-commonvoice-single-female", savedir="pretrained_models/tts-tacotron2-commonvoice-single-female")
hifi_gan = HIFIGAN.from_hparams(source="Nick256/tts-hifigan-commonvoice-single-female", savedir="pretrained_model/tts-hifigan-commonvoice-single-female")

# Running the TTS
mel_output, mel_length, alignment = tacotron2.encode_text("osiibye otya leero")

# Running Vocoder (spectrogram-to-waveform)
waveforms = hifi_gan.decode_batch(mel_output)

# Save the waverform
torchaudio.save('example_TTS.wav',waveforms.squeeze(1), 22050)
```

### Inference on GPU
To perform inference on the GPU, add  `run_opts={"device":"cuda"}`  when calling the `from_hparams` method.

### Training
The model was trained with SpeechBrain.
To train it from scratch follow these steps:
1. Clone SpeechBrain:
```bash
git clone https://github.com/speechbrain/speechbrain/
```
2. Install it:
```bash
cd speechbrain
pip install -r requirements.txt
pip install -e .
```
3. Run Training:
```bash
cd recipes/LJSpeech/TTS/vocoder/hifi_gan/
python train.py hparams/train.yaml --data_folder /path/to/LJspeech
```
You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/19sLwV7nAsnUuLkoTu5vafURA9Fo2WZgG?usp=sharing).