k-m-irfan
/

Fastspeech2_HS_Flask_API

Model card Files Files and versions Community

k-m-irfan commited on Nov 13, 2023

Commit

20935f9

•

1 Parent(s): 0a893bd

added readme file

Browse files

Files changed (3) hide show

.gitignore +0 -1
README.md +120 -0
hifigan/README.md +105 -0

.gitignore CHANGED Viewed

@@ -1,5 +1,4 @@
 tts-hs-hifigan
-README.md
 test.py
 steps.txt
 __pycache__

 tts-hs-hifigan
 test.py
 steps.txt
 __pycache__

README.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# Fastspeech2_HS_Flask_API
+Flask API implementation of the Text to Speech Model developed my Speech Lab, IIT Madras.
+Please refer the original repository for more details of models and inference.
+The Repo is large in size, some of the alrge files are uploaded using gut lfs
+please install latest git LFS from the link, before proceeding to download large model files.
+```
+curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.python.sh | bash
+sudo apt-get install git-lfs
+git lfs install
+```
+The complete repository along with the models are uploaded into hugginface.
+Clone the huggingface repo below
+```
+git clone https://huggingface.co/k-m-irfan/Fastspeech2_HS_Flask_API
+```
+Or download the models form original repository [add link] and arrange the folder structure in the given format:
+```
+models
+├── assamese
+│   ├── female
+│   └── male
+├── bengali
+│   ├── female
+│   └── male
+├── bodo
+│   └── female
+├── english
+│   ├── female
+│   └── male
+├── gujarati
+│   ├── female
+│   └── male
+├── hindi
+│   ├── female
+│   └── male
+├── kannada
+│   ├── female
+│   └── male
+├── malayalam
+│   ├── female
+│   └── male
+├── manipuri
+│   ├── female
+│   └── male
+├── marathi
+│   ├── female
+│   └── male
+├── odia
+│   ├── female
+│   └── male
+├── punjabi
+│   ├── female
+│   └── male
+├── rajasthani
+│   ├── female
+│   └── male
+├── tamil
+│   ├── female
+│   └── male
+├── telugu
+│   ├── female
+│   └── male
+└── urdu
+    ├── female
+    └── male
+```
+Installation:
+create a virtual environment, and activate it:
+```
+python3 -m venv tts-hs-hifigan
+source tts-hs-hifigan/bin/activate
+```
+install requirements:
+```
+pip install -r requirements.txt
+```
+check if the app is running correctly, or look for any errors using commands below:
+```
+python3 flask_app.py
+# OR
+gunicorn -w 2 -b 0.0.0.0:5000 flask_app:app --timeout 600
+```
+If it's running without any problem. run the start command to start the server.
+```
+bash start.sh
+```
+### Citation
+If you use this Fastspeech2 Model in your research or work, please consider citing:
+“
+COPYRIGHT
+2023, Speech Technology Consortium,
+Bhashini, MeiTY and by Hema A Murthy & S Umesh,
+DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
+and
+ELECTRICAL ENGINEERING,
+IIT MADRAS. ALL RIGHTS RESERVED "
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

hifigan/README.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
+### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
+In our [paper](https://arxiv.org/abs/2010.05646),
+we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
+We provide our implementation and pretrained models as open source in this repository.
+**Abstract :**
+Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
+Although such methods improve the sampling efficiency and memory usage,
+their sample quality has not yet reached that of autoregressive and flow-based generative models.
+In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
+As speech audio consists of sinusoidal signals with various periods,
+we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
+A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
+demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
+real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
+speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
+faster than real-time on CPU with comparable quality to an autoregressive counterpart.
+Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
+## Pre-requisites
+1. Python >= 3.6
+2. Clone this repository.
+3. Install python requirements. Please refer [requirements.txt](requirements.txt)
+4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
+And move all wav files to `LJSpeech-1.1/wavs`
+## Training
+```
+python train.py --config config_v1.json
+```
+To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
+Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
+You can change the path by adding `--checkpoint_path` option.
+Validation loss during training with V1 generator.<br>
+![validation loss](./validation_loss.png)
+## Pretrained Model
+You can also use pretrained models we provide.<br/>
+[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
+Details of each folder are as in follows:
+|Folder Name|Generator|Dataset|Fine-Tuned|
+|------|---|---|---|
+|LJ_V1|V1|LJSpeech|No|
+|LJ_V2|V2|LJSpeech|No|
+|LJ_V3|V3|LJSpeech|No|
+|LJ_FT_T2_V1|V1|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
+|LJ_FT_T2_V2|V2|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
+|LJ_FT_T2_V3|V3|LJSpeech|Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2))|
+|VCTK_V1|V1|VCTK|No|
+|VCTK_V2|V2|VCTK|No|
+|VCTK_V3|V3|VCTK|No|
+|UNIVERSAL_V1|V1|Universal|No|
+We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
+## Fine-Tuning
+1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
+The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
+Example:
+    ```
+    Audio File : LJ001-0001.wav
+    Mel-Spectrogram File : LJ001-0001.npy
+    ```
+2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
+3. Run the following command.
+    ```
+    python train.py --fine_tuning True --config config_v1.json
+    ```
+    For other command line options, please refer to the training section.
+## Inference from wav file
+1. Make `test_files` directory and copy wav files into the directory.
+2. Run the following command.
+    ```
+    python inference.py --checkpoint_file [generator checkpoint file path]
+    ```
+Generated wav files are saved in `generated_files` by default.<br>
+You can change the path by adding `--output_dir` option.
+## Inference for end-to-end speech synthesis
+1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
+You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
+[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
+2. Run the following command.
+    ```
+    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
+    ```
+Generated wav files are saved in `generated_files_from_mel` by default.<br>
+You can change the path by adding `--output_dir` option.
+## Acknowledgements
+We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
+and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.