k-m-irfan commited on
Commit
20935f9
1 Parent(s): 0a893bd

added readme file

Browse files
Files changed (3) hide show
  1. .gitignore +0 -1
  2. README.md +120 -0
  3. hifigan/README.md +105 -0
.gitignore CHANGED
@@ -1,5 +1,4 @@
1
  tts-hs-hifigan
2
- README.md
3
  test.py
4
  steps.txt
5
  __pycache__
 
1
  tts-hs-hifigan
 
2
  test.py
3
  steps.txt
4
  __pycache__
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fastspeech2_HS_Flask_API
2
+ Flask API implementation of the Text to Speech Model developed my Speech Lab, IIT Madras.
3
+
4
+ Please refer the original repository for more details of models and inference.
5
+
6
+ The Repo is large in size, some of the alrge files are uploaded using gut lfs
7
+
8
+ please install latest git LFS from the link, before proceeding to download large model files.
9
+
10
+ ```
11
+ curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.python.sh | bash
12
+ sudo apt-get install git-lfs
13
+ git lfs install
14
+ ```
15
+
16
+ The complete repository along with the models are uploaded into hugginface.
17
+
18
+ Clone the huggingface repo below
19
+ ```
20
+ git clone https://huggingface.co/k-m-irfan/Fastspeech2_HS_Flask_API
21
+ ```
22
+ Or download the models form original repository [add link] and arrange the folder structure in the given format:
23
+
24
+ ```
25
+ models
26
+ ├── assamese
27
+ │ ├── female
28
+ │ └── male
29
+ ├── bengali
30
+ │ ├── female
31
+ │ └── male
32
+ ├── bodo
33
+ │ └── female
34
+ ├── english
35
+ │ ├── female
36
+ │ └── male
37
+ ├── gujarati
38
+ │ ├── female
39
+ │ └── male
40
+ ├── hindi
41
+ │ ├── female
42
+ │ └── male
43
+ ├── kannada
44
+ │ ├── female
45
+ │ └── male
46
+ ├── malayalam
47
+ │ ├── female
48
+ │ └── male
49
+ ├── manipuri
50
+ │ ├── female
51
+ │ └── male
52
+ ├── marathi
53
+ │ ├── female
54
+ │ └── male
55
+ ├── odia
56
+ │ ├── female
57
+ │ └── male
58
+ ├── punjabi
59
+ │ ├── female
60
+ │ └── male
61
+ ├── rajasthani
62
+ │ ├── female
63
+ │ └── male
64
+ ├── tamil
65
+ │ ├── female
66
+ │ └── male
67
+ ├── telugu
68
+ │ ├── female
69
+ │ └── male
70
+ └── urdu
71
+ ├── female
72
+ └── male
73
+ ```
74
+
75
+ Installation:
76
+
77
+ create a virtual environment, and activate it:
78
+ ```
79
+ python3 -m venv tts-hs-hifigan
80
+ source tts-hs-hifigan/bin/activate
81
+ ```
82
+ install requirements:
83
+ ```
84
+ pip install -r requirements.txt
85
+ ```
86
+ check if the app is running correctly, or look for any errors using commands below:
87
+ ```
88
+ python3 flask_app.py
89
+ # OR
90
+ gunicorn -w 2 -b 0.0.0.0:5000 flask_app:app --timeout 600
91
+ ```
92
+ If it's running without any problem. run the start command to start the server.
93
+ ```
94
+ bash start.sh
95
+ ```
96
+
97
+ ### Citation
98
+ If you use this Fastspeech2 Model in your research or work, please consider citing:
99
+
100
+
101
+ COPYRIGHT
102
+ 2023, Speech Technology Consortium,
103
+ Bhashini, MeiTY and by Hema A Murthy & S Umesh,
104
+ DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
105
+ and
106
+ ELECTRICAL ENGINEERING,
107
+ IIT MADRAS. ALL RIGHTS RESERVED "
108
+
109
+
110
+
111
+ Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
112
+
113
+ This work is licensed under a
114
+ [Creative Commons Attribution 4.0 International License][cc-by].
115
+
116
+ [![CC BY 4.0][cc-by-image]][cc-by]
117
+
118
+ [cc-by]: http://creativecommons.org/licenses/by/4.0/
119
+ [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
120
+ [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
hifigan/README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
2
+
3
+ ### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
4
+
5
+ In our [paper](https://arxiv.org/abs/2010.05646),
6
+ we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
7
+ We provide our implementation and pretrained models as open source in this repository.
8
+
9
+ **Abstract :**
10
+ Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
11
+ Although such methods improve the sampling efficiency and memory usage,
12
+ their sample quality has not yet reached that of autoregressive and flow-based generative models.
13
+ In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
14
+ As speech audio consists of sinusoidal signals with various periods,
15
+ we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
16
+ A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
17
+ demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
18
+ real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
19
+ speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
20
+ faster than real-time on CPU with comparable quality to an autoregressive counterpart.
21
+
22
+ Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
23
+
24
+
25
+ ## Pre-requisites
26
+ 1. Python >= 3.6
27
+ 2. Clone this repository.
28
+ 3. Install python requirements. Please refer [requirements.txt](requirements.txt)
29
+ 4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
30
+ And move all wav files to `LJSpeech-1.1/wavs`
31
+
32
+
33
+ ## Training
34
+ ```
35
+ python train.py --config config_v1.json
36
+ ```
37
+ To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
38
+ Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
39
+ You can change the path by adding `--checkpoint_path` option.
40
+
41
+ Validation loss during training with V1 generator.<br>
42
+ ![validation loss](./validation_loss.png)
43
+
44
+ ## Pretrained Model
45
+ You can also use pretrained models we provide.<br/>
46
+ [Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
47
+ Details of each folder are as in follows:
48
+
49
+ |Folder Name|Generator|Dataset|Fine-Tuned|
50
+ |------|---|---|---|
51
+ |LJ_V1|V1|LJSpeech|No|
52
+ |LJ_V2|V2|LJSpeech|No|
53
+ |LJ_V3|V3|LJSpeech|No|
54
+ |LJ_FT_T2_V1|V1|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
55
+ |LJ_FT_T2_V2|V2|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
56
+ |LJ_FT_T2_V3|V3|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
57
+ |VCTK_V1|V1|VCTK|No|
58
+ |VCTK_V2|V2|VCTK|No|
59
+ |VCTK_V3|V3|VCTK|No|
60
+ |UNIVERSAL_V1|V1|Universal|No|
61
+
62
+ We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
63
+
64
+ ## Fine-Tuning
65
+ 1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
66
+ The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
67
+ Example:
68
+ ```
69
+ Audio File : LJ001-0001.wav
70
+ Mel-Spectrogram File : LJ001-0001.npy
71
+ ```
72
+ 2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
73
+ 3. Run the following command.
74
+ ```
75
+ python train.py --fine_tuning True --config config_v1.json
76
+ ```
77
+ For other command line options, please refer to the training section.
78
+
79
+
80
+ ## Inference from wav file
81
+ 1. Make `test_files` directory and copy wav files into the directory.
82
+ 2. Run the following command.
83
+ ```
84
+ python inference.py --checkpoint_file [generator checkpoint file path]
85
+ ```
86
+ Generated wav files are saved in `generated_files` by default.<br>
87
+ You can change the path by adding `--output_dir` option.
88
+
89
+
90
+ ## Inference for end-to-end speech synthesis
91
+ 1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
92
+ You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
93
+ [Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
94
+ 2. Run the following command.
95
+ ```
96
+ python inference_e2e.py --checkpoint_file [generator checkpoint file path]
97
+ ```
98
+ Generated wav files are saved in `generated_files_from_mel` by default.<br>
99
+ You can change the path by adding `--output_dir` option.
100
+
101
+
102
+ ## Acknowledgements
103
+ We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
104
+ and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.
105
+