added readme file
Browse files- .gitignore +0 -1
- README.md +120 -0
- hifigan/README.md +105 -0
.gitignore
CHANGED
@@ -1,5 +1,4 @@
|
|
1 |
tts-hs-hifigan
|
2 |
-
README.md
|
3 |
test.py
|
4 |
steps.txt
|
5 |
__pycache__
|
|
|
1 |
tts-hs-hifigan
|
|
|
2 |
test.py
|
3 |
steps.txt
|
4 |
__pycache__
|
README.md
ADDED
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Fastspeech2_HS_Flask_API
|
2 |
+
Flask API implementation of the Text to Speech Model developed my Speech Lab, IIT Madras.
|
3 |
+
|
4 |
+
Please refer the original repository for more details of models and inference.
|
5 |
+
|
6 |
+
The Repo is large in size, some of the alrge files are uploaded using gut lfs
|
7 |
+
|
8 |
+
please install latest git LFS from the link, before proceeding to download large model files.
|
9 |
+
|
10 |
+
```
|
11 |
+
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.python.sh | bash
|
12 |
+
sudo apt-get install git-lfs
|
13 |
+
git lfs install
|
14 |
+
```
|
15 |
+
|
16 |
+
The complete repository along with the models are uploaded into hugginface.
|
17 |
+
|
18 |
+
Clone the huggingface repo below
|
19 |
+
```
|
20 |
+
git clone https://huggingface.co/k-m-irfan/Fastspeech2_HS_Flask_API
|
21 |
+
```
|
22 |
+
Or download the models form original repository [add link] and arrange the folder structure in the given format:
|
23 |
+
|
24 |
+
```
|
25 |
+
models
|
26 |
+
├── assamese
|
27 |
+
│ ├── female
|
28 |
+
│ └── male
|
29 |
+
├── bengali
|
30 |
+
│ ├── female
|
31 |
+
│ └── male
|
32 |
+
├── bodo
|
33 |
+
│ └── female
|
34 |
+
├── english
|
35 |
+
│ ├── female
|
36 |
+
│ └── male
|
37 |
+
├── gujarati
|
38 |
+
│ ├── female
|
39 |
+
│ └── male
|
40 |
+
├── hindi
|
41 |
+
│ ├── female
|
42 |
+
│ └── male
|
43 |
+
├── kannada
|
44 |
+
│ ├── female
|
45 |
+
│ └── male
|
46 |
+
├── malayalam
|
47 |
+
│ ├── female
|
48 |
+
│ └── male
|
49 |
+
├── manipuri
|
50 |
+
│ ├── female
|
51 |
+
│ └── male
|
52 |
+
├── marathi
|
53 |
+
│ ├── female
|
54 |
+
│ └── male
|
55 |
+
├── odia
|
56 |
+
│ ├── female
|
57 |
+
│ └── male
|
58 |
+
├── punjabi
|
59 |
+
│ ├── female
|
60 |
+
│ └── male
|
61 |
+
├── rajasthani
|
62 |
+
│ ├── female
|
63 |
+
│ └── male
|
64 |
+
├── tamil
|
65 |
+
│ ├── female
|
66 |
+
│ └── male
|
67 |
+
├── telugu
|
68 |
+
│ ├── female
|
69 |
+
│ └── male
|
70 |
+
└── urdu
|
71 |
+
├── female
|
72 |
+
└── male
|
73 |
+
```
|
74 |
+
|
75 |
+
Installation:
|
76 |
+
|
77 |
+
create a virtual environment, and activate it:
|
78 |
+
```
|
79 |
+
python3 -m venv tts-hs-hifigan
|
80 |
+
source tts-hs-hifigan/bin/activate
|
81 |
+
```
|
82 |
+
install requirements:
|
83 |
+
```
|
84 |
+
pip install -r requirements.txt
|
85 |
+
```
|
86 |
+
check if the app is running correctly, or look for any errors using commands below:
|
87 |
+
```
|
88 |
+
python3 flask_app.py
|
89 |
+
# OR
|
90 |
+
gunicorn -w 2 -b 0.0.0.0:5000 flask_app:app --timeout 600
|
91 |
+
```
|
92 |
+
If it's running without any problem. run the start command to start the server.
|
93 |
+
```
|
94 |
+
bash start.sh
|
95 |
+
```
|
96 |
+
|
97 |
+
### Citation
|
98 |
+
If you use this Fastspeech2 Model in your research or work, please consider citing:
|
99 |
+
|
100 |
+
“
|
101 |
+
COPYRIGHT
|
102 |
+
2023, Speech Technology Consortium,
|
103 |
+
Bhashini, MeiTY and by Hema A Murthy & S Umesh,
|
104 |
+
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
|
105 |
+
and
|
106 |
+
ELECTRICAL ENGINEERING,
|
107 |
+
IIT MADRAS. ALL RIGHTS RESERVED "
|
108 |
+
|
109 |
+
|
110 |
+
|
111 |
+
Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
|
112 |
+
|
113 |
+
This work is licensed under a
|
114 |
+
[Creative Commons Attribution 4.0 International License][cc-by].
|
115 |
+
|
116 |
+
[![CC BY 4.0][cc-by-image]][cc-by]
|
117 |
+
|
118 |
+
[cc-by]: http://creativecommons.org/licenses/by/4.0/
|
119 |
+
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
|
120 |
+
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
|
hifigan/README.md
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
|
2 |
+
|
3 |
+
### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
|
4 |
+
|
5 |
+
In our [paper](https://arxiv.org/abs/2010.05646),
|
6 |
+
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
|
7 |
+
We provide our implementation and pretrained models as open source in this repository.
|
8 |
+
|
9 |
+
**Abstract :**
|
10 |
+
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
|
11 |
+
Although such methods improve the sampling efficiency and memory usage,
|
12 |
+
their sample quality has not yet reached that of autoregressive and flow-based generative models.
|
13 |
+
In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
|
14 |
+
As speech audio consists of sinusoidal signals with various periods,
|
15 |
+
we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
|
16 |
+
A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
|
17 |
+
demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
|
18 |
+
real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
|
19 |
+
speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
|
20 |
+
faster than real-time on CPU with comparable quality to an autoregressive counterpart.
|
21 |
+
|
22 |
+
Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
|
23 |
+
|
24 |
+
|
25 |
+
## Pre-requisites
|
26 |
+
1. Python >= 3.6
|
27 |
+
2. Clone this repository.
|
28 |
+
3. Install python requirements. Please refer [requirements.txt](requirements.txt)
|
29 |
+
4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
|
30 |
+
And move all wav files to `LJSpeech-1.1/wavs`
|
31 |
+
|
32 |
+
|
33 |
+
## Training
|
34 |
+
```
|
35 |
+
python train.py --config config_v1.json
|
36 |
+
```
|
37 |
+
To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
|
38 |
+
Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
|
39 |
+
You can change the path by adding `--checkpoint_path` option.
|
40 |
+
|
41 |
+
Validation loss during training with V1 generator.<br>
|
42 |
+
![validation loss](./validation_loss.png)
|
43 |
+
|
44 |
+
## Pretrained Model
|
45 |
+
You can also use pretrained models we provide.<br/>
|
46 |
+
[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
|
47 |
+
Details of each folder are as in follows:
|
48 |
+
|
49 |
+
|Folder Name|Generator|Dataset|Fine-Tuned|
|
50 |
+
|------|---|---|---|
|
51 |
+
|LJ_V1|V1|LJSpeech|No|
|
52 |
+
|LJ_V2|V2|LJSpeech|No|
|
53 |
+
|LJ_V3|V3|LJSpeech|No|
|
54 |
+
|LJ_FT_T2_V1|V1|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
|
55 |
+
|LJ_FT_T2_V2|V2|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
|
56 |
+
|LJ_FT_T2_V3|V3|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
|
57 |
+
|VCTK_V1|V1|VCTK|No|
|
58 |
+
|VCTK_V2|V2|VCTK|No|
|
59 |
+
|VCTK_V3|V3|VCTK|No|
|
60 |
+
|UNIVERSAL_V1|V1|Universal|No|
|
61 |
+
|
62 |
+
We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
|
63 |
+
|
64 |
+
## Fine-Tuning
|
65 |
+
1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
|
66 |
+
The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
|
67 |
+
Example:
|
68 |
+
```
|
69 |
+
Audio File : LJ001-0001.wav
|
70 |
+
Mel-Spectrogram File : LJ001-0001.npy
|
71 |
+
```
|
72 |
+
2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
|
73 |
+
3. Run the following command.
|
74 |
+
```
|
75 |
+
python train.py --fine_tuning True --config config_v1.json
|
76 |
+
```
|
77 |
+
For other command line options, please refer to the training section.
|
78 |
+
|
79 |
+
|
80 |
+
## Inference from wav file
|
81 |
+
1. Make `test_files` directory and copy wav files into the directory.
|
82 |
+
2. Run the following command.
|
83 |
+
```
|
84 |
+
python inference.py --checkpoint_file [generator checkpoint file path]
|
85 |
+
```
|
86 |
+
Generated wav files are saved in `generated_files` by default.<br>
|
87 |
+
You can change the path by adding `--output_dir` option.
|
88 |
+
|
89 |
+
|
90 |
+
## Inference for end-to-end speech synthesis
|
91 |
+
1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
|
92 |
+
You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
|
93 |
+
[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
|
94 |
+
2. Run the following command.
|
95 |
+
```
|
96 |
+
python inference_e2e.py --checkpoint_file [generator checkpoint file path]
|
97 |
+
```
|
98 |
+
Generated wav files are saved in `generated_files_from_mel` by default.<br>
|
99 |
+
You can change the path by adding `--output_dir` option.
|
100 |
+
|
101 |
+
|
102 |
+
## Acknowledgements
|
103 |
+
We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
|
104 |
+
and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.
|
105 |
+
|