Spaces:

zomehwh
/

sovits-models

Runtime error

App Files Files Community

SayaSS commited on Apr 2, 2023

Commit

c17721a

•

1 Parent(s): c2dde5f

update

Browse files

Files changed (18) hide show

Eng_docs.md +0 -109
app.py +7 -12
data_utils.py +0 -142
flask_api.py +0 -56
inference/__pycache__/__init__.cpython-38.pyc +0 -0
inference/__pycache__/infer_tool.cpython-38.pyc +0 -0
inference/__pycache__/slicer.cpython-38.pyc +0 -0
inference/infer_tool.py +62 -22
modules/__pycache__/__init__.cpython-38.pyc +0 -0
modules/__pycache__/attentions.cpython-38.pyc +0 -0
modules/__pycache__/commons.cpython-38.pyc +0 -0
modules/__pycache__/modules.cpython-38.pyc +0 -0
preprocess_flist_config.py +0 -67
preprocess_hubert_f0.py +0 -62
resample.py +0 -48
spec_gen.py +0 -22
train.py +0 -297
utils.py +3 -9

Eng_docs.md DELETED Viewed

@@ -1,109 +0,0 @@
-# SoftVC VITS Singing Voice Conversion
-## Updates
-> According to incomplete statistics, it seems that training with multiple speakers may lead to **worsened leaking of voice timbre**. It is not recommended to train models with more than 5 speakers. The current suggestion is to try to train models with only a single speaker if you want to achieve a voice timbre that is more similar to the target.
-> Fixed the issue with unwanted staccato, improving audio quality by a decent amount.\
-> The 2.0 version has been moved to the 2.0 branch.\
-> Version 3.0 uses the code structure of FreeVC, which isn't compatible with older versions.\
-> Compared to [DiffSVC](https://github.com/prophesier/diff-svc) , diffsvc performs much better when the training data is of extremely high quality, but this repository may perform better on datasets with lower quality. Additionally, this repository is much faster in terms of inference speed compared to diffsvc.
-## Model Overview
-A singing voice coversion (SVC) model, using the SoftVC encoder to extract features from the input audio, sent into VITS along with the F0 to replace the original input to acheive a voice conversion effect. Additionally, changing the vocoder to [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to fix the issue with unwanted staccato.
-## Notice
-+ The current branch is the 32kHz version, which requires less vram during inferencing, as well as faster inferencing speeds, and datasets for said branch take up less disk space. Thus the 32 kHz branch is recommended for use.
-+ If you want to train 48 kHz variant models, switch to the [main branch](https://github.com/innnky/so-vits-svc/tree/main).
-## Required models
-+ soft vc hubert：[hubert-soft-0d54a1f4.pt](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt)
-  + Place under `hubert`.
-+ Pretrained models [G_0.pth](https://huggingface.co/innnky/sovits_pretrained/resolve/main/G_0.pth) and [D_0.pth](https://huggingface.co/innnky/sovits_pretrained/resolve/main/D_0.pth)
-  + Place under `logs/32k`.
-  + Pretrained models are required, because from experiments, training from scratch can be rather unpredictable to say the least, and training with a pretrained model can greatly improve training speeds.
-  + The pretrained model includes云灏, 即霜, 辉宇·星AI, 派蒙, and 绫地宁宁, covering the common ranges of both male and female voices, and so it can be seen as a rather universal pretrained model.
-  + The pretrained model exludes the `optimizer speaker_embedding` section, rendering it only usable for pretraining and incapable of inferencing with.
-```shell
-# For simple downloading.
-# hubert
-wget -P hubert/ https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt
-# G&D pretrained models
-wget -P logs/32k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/G_0.pth
-wget -P logs/32k/ https://huggingface.co/innnky/sovits_pretrained/resolve/main/D_0.pth
-```
-## Colab notebook script for dataset creation and training.
-[colab training notebook](https://colab.research.google.com/drive/1rCUOOVG7-XQlVZuWRAj5IpGrMM8t07pE?usp=sharing)
-## Dataset preparation
-All that is required is that the data be put under the `dataset_raw` folder in the structure format provided below.
-```shell
-dataset_raw
-├───speaker0
-│   ├───xxx1-xxx1.wav
-│   ├───...
-│   └───Lxx-0xx8.wav
-└───speaker1
-    ├───xx2-0xxx2.wav
-    ├───...
-    └───xxx7-xxx007.wav
-```
-## Data pre-processing.
-1. Resample to 32khz
-```shell
-python resample.py
- ```
-2. Automatically sort out training set, validation set, test set, and automatically generate configuration files.
-```shell
-python preprocess_flist_config.py
-# Notice.
-# The n_speakers value in the config will be set automatically according to the amount of speakers in the dataset.
-# To reserve space for additionally added speakers in the dataset, the n_speakers value will be be set to twice the actual amount.
-# If you want even more space for adding more data, you can edit the n_speakers value in the config after runing this step.
-# This can not be changed after training starts.
-```
-3. Generate hubert and F0 features/
-```shell
-python preprocess_hubert_f0.py
-```
-After running the step above, the `dataset` folder will contain all the pre-processed data, you can delete the `dataset_raw` folder after that.
-## Training.
-```shell
-python train.py -c configs/config.json -m 32k
-```
-## Inferencing.
-Use [inference_main.py](inference_main.py)
-+ Edit `model_path` to your newest checkpoint.
-+ Place the input audio under the `raw` folder.
-+ Change `clean_names` to the output file name.
-+ Use `trans` to edit the pitch shifting amount (semitones).
-+ Change `spk_list` to the speaker name.
-## Onnx Exporting.
-### **When exporting Onnx, please make sure you re-clone the whole repository!!!**
-Use [onnx_export.py](onnx_export.py)
-+ Create a new folder called `checkpoints`.
-+ Create a project folder in `checkpoints` folder with the desired name for your project, let's use `myproject` as example. Folder structure looks like `./checkpoints/myproject`.
-+ Rename your model to `model.pth`, rename your config file to `config.json` then move them into `myproject` folder.
-+ Modify [onnx_export.py](onnx_export.py) where `path = "NyaruTaffy"`, change `NyaruTaffy` to your project name, here it will be `path = "myproject"`.
-+ Run [onnx_export.py](onnx_export.py)
-+ Once it finished, a `model.onnx` will be generated in `myproject` folder, that's the model you just exported.
-+ Notice: if you want to export a 48K model, please follow the instruction below or use `model_onnx_48k.py` directly.
-    + Open [model_onnx.py](model_onnx.py) and change `hps={"sampling_rate": 32000...}` to `hps={"sampling_rate": 48000}` in class `SynthesizerTrn`.
-    + Open [nvSTFT](/vdecoder/hifigan/nvSTFT.py) and replace all `32000` with `48000`
-    ### Onnx Model UI Support
-    + [MoeSS](https://github.com/NaruseMioShirakana/MoeSS)
-+ All training function and transformation are removed, only if they are all removed you are actually using Onnx.
-## Gradio (WebUI)
-Use [sovits_gradio.py](sovits_gradio.py) to run Gradio WebUI
-+ Create a new folder called `checkpoints`.
-+ Create a project folder in `checkpoints` folder with the desired name for your project, let's use `myproject` as example. Folder structure looks like `./checkpoints/myproject`.
-+ Rename your model to `model.pth`, rename your config file to `config.json` then move them into `myproject` folder.
-+ Run [sovits_gradio.py](sovits_gradio.py)

app.py CHANGED Viewed

@@ -31,20 +31,15 @@ def create_vc_fn(model, sid):
         if input_audio is None:
             return "You need to upload an audio", None
         sampling_rate, audio = input_audio
-        # print(audio.shape,sampling_rate)
         duration = audio.shape[0] / sampling_rate
-        if duration > 45 and limitation:
-            return "Please upload an audio file that is less than 45 seconds. If you need to generate a longer audio file, please use Colab.", None
         audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
         if len(audio.shape) > 1:
             audio = librosa.to_mono(audio.transpose(1, 0))
-        if sampling_rate != 16000:
-            audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=16000)
-        out_wav_path = "temp.wav"
-        soundfile.write(out_wav_path, audio, 16000, format="wav")
-        out_audio, out_sr = model.infer(sid, vc_transform, out_wav_path,
-                                       auto_predict_f0=auto_f0,
-                                       )
         return "Success", (44100, out_audio.cpu().numpy())
     return vc_fn
@@ -64,11 +59,11 @@ if __name__ == '__main__':
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
         gr.Markdown(
-            "# <center> Sovits Umamusume\n"
             "## <center> The input audio should be clean and pure voice without background music.\n"
             "![visitor badge](https://visitor-badge.glitch.me/badge?page_id=sayashi.Sovits-Umamusume)\n\n"
             "[Open In Colab](https://colab.research.google.com/drive/1wfsBbMzmtLflOJeqc5ZnJiLY7L239hJW?usp=share_link)"
-            "\n\n"
             "[Original Repo](https://github.com/innnky/so-vits-svc/tree/4.0)"
         )
         with gr.Tabs():

         if input_audio is None:
             return "You need to upload an audio", None
         sampling_rate, audio = input_audio
         duration = audio.shape[0] / sampling_rate
+        if duration > 30 and limitation:
+            return "Please upload an audio file that is less than 30 seconds. If you need to generate a longer audio file, please use Colab.", None
         audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
         if len(audio.shape) > 1:
             audio = librosa.to_mono(audio.transpose(1, 0))
+        if sampling_rate != 44100:
+            audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=44100)
+        out_audio, out_sr = model.infer(sid, vc_transform, audio, auto_predict_f0=auto_f0)
         return "Success", (44100, out_audio.cpu().numpy())
     return vc_fn
         models.append((name, cover, create_vc_fn(model, name)))
     with gr.Blocks() as app:
         gr.Markdown(
+            "# <center> Sovits Models\n"
             "## <center> The input audio should be clean and pure voice without background music.\n"
             "![visitor badge](https://visitor-badge.glitch.me/badge?page_id=sayashi.Sovits-Umamusume)\n\n"
             "[Open In Colab](https://colab.research.google.com/drive/1wfsBbMzmtLflOJeqc5ZnJiLY7L239hJW?usp=share_link)"
+            " without queue and length limitation.\n\n"
             "[Original Repo](https://github.com/innnky/so-vits-svc/tree/4.0)"
         )
         with gr.Tabs():

data_utils.py DELETED Viewed

@@ -1,142 +0,0 @@
-import time
-import os
-import random
-import numpy as np
-import torch
-import torch.utils.data
-import modules.commons as commons
-import utils
-from modules.mel_processing import spectrogram_torch, spec_to_mel_torch
-from utils import load_wav_to_torch, load_filepaths_and_text
-# import h5py
-"""Multi speaker version"""
-class TextAudioSpeakerLoader(torch.utils.data.Dataset):
-    """
-        1) loads audio, speaker_id, text pairs
-        2) normalizes text and converts them to sequences of integers
-        3) computes spectrograms from audio files.
-    """
-    def __init__(self, audiopaths, hparams):
-        self.audiopaths = load_filepaths_and_text(audiopaths)
-        self.max_wav_value = hparams.data.max_wav_value
-        self.sampling_rate = hparams.data.sampling_rate
-        self.filter_length = hparams.data.filter_length
-        self.hop_length = hparams.data.hop_length
-        self.win_length = hparams.data.win_length
-        self.sampling_rate = hparams.data.sampling_rate
-        self.use_sr = hparams.train.use_sr
-        self.spec_len = hparams.train.max_speclen
-        self.spk_map = hparams.spk
-        random.seed(1234)
-        random.shuffle(self.audiopaths)
-    def get_audio(self, filename):
-        filename = filename.replace("\\", "/")
-        audio, sampling_rate = load_wav_to_torch(filename)
-        if sampling_rate != self.sampling_rate:
-            raise ValueError("{} SR doesn't match target {} SR".format(
-                sampling_rate, self.sampling_rate))
-        audio_norm = audio / self.max_wav_value
-        audio_norm = audio_norm.unsqueeze(0)
-        spec_filename = filename.replace(".wav", ".spec.pt")
-        if os.path.exists(spec_filename):
-            spec = torch.load(spec_filename)
-        else:
-            spec = spectrogram_torch(audio_norm, self.filter_length,
-                                     self.sampling_rate, self.hop_length, self.win_length,
-                                     center=False)
-            spec = torch.squeeze(spec, 0)
-            torch.save(spec, spec_filename)
-        spk = filename.split("/")[-2]
-        spk = torch.LongTensor([self.spk_map[spk]])
-        f0 = np.load(filename + ".f0.npy")
-        f0, uv = utils.interpolate_f0(f0)
-        f0 = torch.FloatTensor(f0)
-        uv = torch.FloatTensor(uv)
-        c = torch.load(filename+ ".soft.pt")
-        c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[0])
-        lmin = min(c.size(-1), spec.size(-1))
-        assert abs(c.size(-1) - spec.size(-1)) < 3, (c.size(-1), spec.size(-1), f0.shape, filename)
-        assert abs(audio_norm.shape[1]-lmin * self.hop_length) < 3 * self.hop_length
-        spec, c, f0, uv = spec[:, :lmin], c[:, :lmin], f0[:lmin], uv[:lmin]
-        audio_norm = audio_norm[:, :lmin * self.hop_length]
-        if spec.shape[1] < 60:
-            print("skip too short audio:", filename)
-            return None
-        if spec.shape[1] > 800:
-            start = random.randint(0, spec.shape[1]-800)
-            end = start + 790
-            spec, c, f0, uv = spec[:, start:end], c[:, start:end], f0[start:end], uv[start:end]
-            audio_norm = audio_norm[:, start * self.hop_length : end * self.hop_length]
-        return c, f0, spec, audio_norm, spk, uv
-    def __getitem__(self, index):
-        return self.get_audio(self.audiopaths[index][0])
-    def __len__(self):
-        return len(self.audiopaths)
-class TextAudioCollate:
-    def __call__(self, batch):
-        batch = [b for b in batch if b is not None]
-        input_lengths, ids_sorted_decreasing = torch.sort(
-            torch.LongTensor([x[0].shape[1] for x in batch]),
-            dim=0, descending=True)
-        max_c_len = max([x[0].size(1) for x in batch])
-        max_wav_len = max([x[3].size(1) for x in batch])
-        lengths = torch.LongTensor(len(batch))
-        c_padded = torch.FloatTensor(len(batch), batch[0][0].shape[0], max_c_len)
-        f0_padded = torch.FloatTensor(len(batch), max_c_len)
-        spec_padded = torch.FloatTensor(len(batch), batch[0][2].shape[0], max_c_len)
-        wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
-        spkids = torch.LongTensor(len(batch), 1)
-        uv_padded = torch.FloatTensor(len(batch), max_c_len)
-        c_padded.zero_()
-        spec_padded.zero_()
-        f0_padded.zero_()
-        wav_padded.zero_()
-        uv_padded.zero_()
-        for i in range(len(ids_sorted_decreasing)):
-            row = batch[ids_sorted_decreasing[i]]
-            c = row[0]
-            c_padded[i, :, :c.size(1)] = c
-            lengths[i] = c.size(1)
-            f0 = row[1]
-            f0_padded[i, :f0.size(0)] = f0
-            spec = row[2]
-            spec_padded[i, :, :spec.size(1)] = spec
-            wav = row[3]
-            wav_padded[i, :, :wav.size(1)] = wav
-            spkids[i, 0] = row[4]
-            uv = row[5]
-            uv_padded[i, :uv.size(0)] = uv
-        return c_padded, f0_padded, spec_padded, wav_padded, spkids, lengths, uv_padded

flask_api.py DELETED Viewed

@@ -1,56 +0,0 @@
-import io
-import logging
-import soundfile
-import torch
-import torchaudio
-from flask import Flask, request, send_file
-from flask_cors import CORS
-from inference.infer_tool import Svc, RealTimeVC
-app = Flask(__name__)
-CORS(app)
-logging.getLogger('numba').setLevel(logging.WARNING)
-@app.route("/voiceChangeModel", methods=["POST"])
-def voice_change_model():
-    request_form = request.form
-    wave_file = request.files.get("sample", None)
-    # 变调信息
-    f_pitch_change = float(request_form.get("fPitchChange", 0))
-    # DAW所需的采样率
-    daw_sample = int(float(request_form.get("sampleRate", 0)))
-    speaker_id = int(float(request_form.get("sSpeakId", 0)))
-    # http获得wav文件并转换
-    input_wav_path = io.BytesIO(wave_file.read())
-    # 模型推理
-    if raw_infer:
-        out_audio, out_sr = svc_model.infer(speaker_id, f_pitch_change, input_wav_path)
-        tar_audio = torchaudio.functional.resample(out_audio, svc_model.target_sample, daw_sample)
-    else:
-        out_audio = svc.process(svc_model, speaker_id, f_pitch_change, input_wav_path)
-        tar_audio = torchaudio.functional.resample(torch.from_numpy(out_audio), svc_model.target_sample, daw_sample)
-    # 返回音频
-    out_wav_path = io.BytesIO()
-    soundfile.write(out_wav_path, tar_audio.cpu().numpy(), daw_sample, format="wav")
-    out_wav_path.seek(0)
-    return send_file(out_wav_path, download_name="temp.wav", as_attachment=True)
-if __name__ == '__main__':
-    # 启用则为直接切片合成，False为交叉淡化方式
-    # vst插件调整0.3-0.5s切片时间可以降低延迟，直接切片方法会有连接处爆音、交叉淡化会有轻微重叠声音
-    # 自行选择能接受的方法，或将vst最大切片时间调整为1s，此处设为Ture，延迟大音质稳定一些
-    raw_infer = True
-    # 每个模型和config是唯一对应的
-    model_name = "logs/32k/G_174000-Copy1.pth"
-    config_name = "configs/config.json"
-    svc_model = Svc(model_name, config_name)
-    svc = RealTimeVC()
-    # 此处与vst插件对应，不建议更改
-    app.run(port=6842, host="0.0.0.0", debug=False, threaded=False)

inference/__pycache__/__init__.cpython-38.pyc CHANGED Viewed

Binary files a/inference/__pycache__/__init__.cpython-38.pyc and b/inference/__pycache__/__init__.cpython-38.pyc differ

inference/__pycache__/infer_tool.cpython-38.pyc CHANGED Viewed

Binary files a/inference/__pycache__/infer_tool.cpython-38.pyc and b/inference/__pycache__/infer_tool.cpython-38.pyc differ

inference/__pycache__/slicer.cpython-38.pyc CHANGED Viewed

Binary files a/inference/__pycache__/slicer.cpython-38.pyc and b/inference/__pycache__/slicer.cpython-38.pyc differ

inference/infer_tool.py CHANGED Viewed

@@ -92,6 +92,21 @@ def mkdir(paths: list):
         if not os.path.exists(path):
             os.mkdir(path)
 class Svc(object):
     def __init__(self, net_g_path, config_path,
@@ -127,10 +142,7 @@ class Svc(object):
-    def get_unit_f0(self, in_path, tran, cluster_infer_ratio, speaker):
-        wav, sr = librosa.load(in_path, sr=self.target_sample)
         f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
         f0, uv = utils.interpolate_f0(f0)
         f0 = torch.FloatTensor(f0)
@@ -139,26 +151,29 @@ class Svc(object):
         f0 = f0.unsqueeze(0).to(self.dev)
         uv = uv.unsqueeze(0).to(self.dev)
-        wav16k = librosa.resample(wav, orig_sr=self.target_sample, target_sr=16000)
         wav16k = torch.from_numpy(wav16k).to(self.dev)
         c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
         c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
         if cluster_infer_ratio !=0:
-            cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.numpy().T, speaker).T
-            cluster_c = torch.FloatTensor(cluster_c)
             c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
         c = c.unsqueeze(0)
         return c, f0, uv
-    def infer(self, speaker, tran, raw_path,
               cluster_infer_ratio=0,
               auto_predict_f0=False,
               noice_scale=0.4):
-        speaker_id = self.spk2id[speaker]
         sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
-        c, f0, uv = self.get_unit_f0(raw_path, tran, cluster_infer_ratio, speaker)
         if "half" in self.net_g_path and torch.cuda.is_available():
             c = c.half()
         with torch.no_grad():
@@ -167,39 +182,64 @@ class Svc(object):
             use_time = time.time() - start
             print("vits use time:{}".format(use_time))
         return audio, audio.shape[-1]
-    def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5):
         wav_path = raw_audio_path
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
         audio = []
         for (slice_tag, data) in audio_data:
             print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
             # padd
-            pad_len = int(audio_sr * pad_seconds)
-            data = np.concatenate([np.zeros([pad_len]), data, np.zeros([pad_len])])
             length = int(np.ceil(len(data) / audio_sr * self.target_sample))
-            raw_path = io.BytesIO()
-            soundfile.write(raw_path, data, audio_sr, format="wav")
-            raw_path.seek(0)
             if slice_tag:
                 print('jump empty segment')
                 _audio = np.zeros(length)
             else:
                 out_audio, out_sr = self.infer(spk, tran, raw_path,
                                                     cluster_infer_ratio=cluster_infer_ratio,
                                                     auto_predict_f0=auto_predict_f0,
                                                     noice_scale=noice_scale
                                                     )
                 _audio = out_audio.cpu().numpy()
-            pad_len = int(self.target_sample * pad_seconds)
-            _audio = _audio[pad_len:-pad_len]
-            audio.extend(list(_audio))
         return np.array(audio)
 class RealTimeVC:
     def __init__(self):
         self.last_chunk = None

         if not os.path.exists(path):
             os.mkdir(path)
+def pad_array(arr, target_length):
+    current_length = arr.shape[0]
+    if current_length >= target_length:
+        return arr
+    else:
+        pad_width = target_length - current_length
+        pad_left = pad_width // 2
+        pad_right = pad_width - pad_left
+        padded_arr = np.pad(arr, (pad_left, pad_right), 'constant', constant_values=(0, 0))
+        return padded_arr
+def split_list_by_n(list_collection, n, pre=0):
+    for i in range(0, len(list_collection), n):
+        yield list_collection[i-pre if i-pre>=0 else i: i + n]
 class Svc(object):
     def __init__(self, net_g_path, config_path,
+    def get_unit_f0(self, wav, tran, cluster_infer_ratio, speaker):
         f0 = utils.compute_f0_parselmouth(wav, sampling_rate=self.target_sample, hop_length=self.hop_size)
         f0, uv = utils.interpolate_f0(f0)
         f0 = torch.FloatTensor(f0)
         f0 = f0.unsqueeze(0).to(self.dev)
         uv = uv.unsqueeze(0).to(self.dev)
+        wav16k = librosa.resample(wav, orig_sr=44100, target_sr=16000)
         wav16k = torch.from_numpy(wav16k).to(self.dev)
         c = utils.get_hubert_content(self.hubert_model, wav_16k_tensor=wav16k)
         c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[1])
         if cluster_infer_ratio !=0:
+            cluster_c = cluster.get_cluster_center_result(self.cluster_model, c.cpu().numpy().T, speaker).T
+            cluster_c = torch.FloatTensor(cluster_c).to(self.dev)
             c = cluster_infer_ratio * cluster_c + (1 - cluster_infer_ratio) * c
         c = c.unsqueeze(0)
         return c, f0, uv
+    def infer(self, speaker, tran, raw_wav,
               cluster_infer_ratio=0,
               auto_predict_f0=False,
               noice_scale=0.4):
+        speaker_id = self.spk2id.__dict__.get(speaker)
+        if not speaker_id and type(speaker) is int:
+            if len(self.spk2id.__dict__) >= speaker:
+                speaker_id = speaker
         sid = torch.LongTensor([int(speaker_id)]).to(self.dev).unsqueeze(0)
+        c, f0, uv = self.get_unit_f0(raw_wav, tran, cluster_infer_ratio, speaker)
         if "half" in self.net_g_path and torch.cuda.is_available():
             c = c.half()
         with torch.no_grad():
             use_time = time.time() - start
             print("vits use time:{}".format(use_time))
         return audio, audio.shape[-1]
+    def clear_empty(self):
+        # 清理显存
+        torch.cuda.empty_cache()
+    def slice_inference(self,raw_audio_path, spk, tran, slice_db,cluster_infer_ratio, auto_predict_f0,noice_scale, pad_seconds=0.5, clip_seconds=0,lg_num=0,lgr_num =0.75):
         wav_path = raw_audio_path
         chunks = slicer.cut(wav_path, db_thresh=slice_db)
         audio_data, audio_sr = slicer.chunks2audio(wav_path, chunks)
+        per_size = int(clip_seconds*audio_sr)
+        lg_size = int(lg_num*audio_sr)
+        lg_size_r = int(lg_size*lgr_num)
+        lg_size_c_l = (lg_size-lg_size_r)//2
+        lg_size_c_r = lg_size-lg_size_r-lg_size_c_l
+        lg = np.linspace(0,1,lg_size_r) if lg_size!=0 else 0
         audio = []
         for (slice_tag, data) in audio_data:
             print(f'#=====segment start, {round(len(data) / audio_sr, 3)}s======')
             # padd
             length = int(np.ceil(len(data) / audio_sr * self.target_sample))
             if slice_tag:
                 print('jump empty segment')
                 _audio = np.zeros(length)
+                audio.extend(list(pad_array(_audio, length)))
+                continue
+            if per_size != 0:
+                datas = split_list_by_n(data, per_size,lg_size)
             else:
+                datas = [data]
+            for k,dat in enumerate(datas):
+                per_length = int(np.ceil(len(dat) / audio_sr * self.target_sample)) if clip_seconds!=0 else length
+                if clip_seconds!=0: print(f'###=====segment clip start, {round(len(dat) / audio_sr, 3)}s======')
+                # padd
+                pad_len = int(audio_sr * pad_seconds)
+                dat = np.concatenate([np.zeros([pad_len]), dat, np.zeros([pad_len])])
+                raw_path = io.BytesIO()
+                soundfile.write(raw_path, dat, audio_sr, format="wav")
+                raw_path.seek(0)
                 out_audio, out_sr = self.infer(spk, tran, raw_path,
                                                     cluster_infer_ratio=cluster_infer_ratio,
                                                     auto_predict_f0=auto_predict_f0,
                                                     noice_scale=noice_scale
                                                     )
                 _audio = out_audio.cpu().numpy()
+                pad_len = int(self.target_sample * pad_seconds)
+                _audio = _audio[pad_len:-pad_len]
+                _audio = pad_array(_audio, per_length)
+                if lg_size!=0 and k!=0:
+                    lg1 = audio[-(lg_size_r+lg_size_c_r):-lg_size_c_r] if lgr_num != 1 else audio[-lg_size:]
+                    lg2 = _audio[lg_size_c_l:lg_size_c_l+lg_size_r]  if lgr_num != 1 else _audio[0:lg_size]
+                    lg_pre = lg1*(1-lg)+lg2*lg
+                    audio = audio[0:-(lg_size_r+lg_size_c_r)] if lgr_num != 1 else audio[0:-lg_size]
+                    audio.extend(lg_pre)
+                    _audio = _audio[lg_size_c_l+lg_size_r:] if lgr_num != 1 else _audio[lg_size:]
+                audio.extend(list(_audio))
         return np.array(audio)
 class RealTimeVC:
     def __init__(self):
         self.last_chunk = None

modules/__pycache__/__init__.cpython-38.pyc CHANGED Viewed

Binary files a/modules/__pycache__/__init__.cpython-38.pyc and b/modules/__pycache__/__init__.cpython-38.pyc differ

modules/__pycache__/attentions.cpython-38.pyc CHANGED Viewed

Binary files a/modules/__pycache__/attentions.cpython-38.pyc and b/modules/__pycache__/attentions.cpython-38.pyc differ

modules/__pycache__/commons.cpython-38.pyc CHANGED Viewed

Binary files a/modules/__pycache__/commons.cpython-38.pyc and b/modules/__pycache__/commons.cpython-38.pyc differ

modules/__pycache__/modules.cpython-38.pyc CHANGED Viewed

Binary files a/modules/__pycache__/modules.cpython-38.pyc and b/modules/__pycache__/modules.cpython-38.pyc differ

preprocess_flist_config.py DELETED Viewed

@@ -1,67 +0,0 @@
-import os
-import argparse
-import re
-from tqdm import tqdm
-from random import shuffle
-import json
-config_template = json.load(open("configs/config.json"))
-pattern = re.compile(r'^[\.a-zA-Z0-9_\/]+$')
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--train_list", type=str, default="./filelists/train.txt", help="path to train list")
-    parser.add_argument("--val_list", type=str, default="./filelists/val.txt", help="path to val list")
-    parser.add_argument("--test_list", type=str, default="./filelists/test.txt", help="path to test list")
-    parser.add_argument("--source_dir", type=str, default="./dataset/44k", help="path to source dir")
-    args = parser.parse_args()
-    train = []
-    val = []
-    test = []
-    idx = 0
-    spk_dict = {}
-    spk_id = 0
-    for speaker in tqdm(os.listdir(args.source_dir)):
-        spk_dict[speaker] = spk_id
-        spk_id += 1
-        wavs = ["/".join([args.source_dir, speaker, i]) for i in os.listdir(os.path.join(args.source_dir, speaker))]
-        for wavpath in wavs:
-            if not pattern.match(wavpath):
-                print(f"warning：文件名{wavpath}中包含非字母数字下划线，可能会导致错误。（也可能不会）")
-        if len(wavs) < 10:
-            print(f"warning：{speaker}数据集数量小于10条，请补充数据")
-        wavs = [i for i in wavs if i.endswith("wav")]
-        shuffle(wavs)
-        train += wavs[2:-2]
-        val += wavs[:2]
-        test += wavs[-2:]
-    shuffle(train)
-    shuffle(val)
-    shuffle(test)
-    print("Writing", args.train_list)
-    with open(args.train_list, "w") as f:
-        for fname in tqdm(train):
-            wavpath = fname
-            f.write(wavpath + "\n")
-    print("Writing", args.val_list)
-    with open(args.val_list, "w") as f:
-        for fname in tqdm(val):
-            wavpath = fname
-            f.write(wavpath + "\n")
-    print("Writing", args.test_list)
-    with open(args.test_list, "w") as f:
-        for fname in tqdm(test):
-            wavpath = fname
-            f.write(wavpath + "\n")
-    config_template["spk"] = spk_dict
-    print("Writing configs/config.json")
-    with open("configs/config.json", "w") as f:
-        json.dump(config_template, f, indent=2)

preprocess_hubert_f0.py DELETED Viewed

@@ -1,62 +0,0 @@
-import math
-import multiprocessing
-import os
-import argparse
-from random import shuffle
-import torch
-from glob import glob
-from tqdm import tqdm
-import utils
-import logging
-logging.getLogger('numba').setLevel(logging.WARNING)
-import librosa
-import numpy as np
-hps = utils.get_hparams_from_file("configs/config.json")
-sampling_rate = hps.data.sampling_rate
-hop_length = hps.data.hop_length
-def process_one(filename, hmodel):
-    # print(filename)
-    wav, sr = librosa.load(filename, sr=sampling_rate)
-    soft_path = filename + ".soft.pt"
-    if not os.path.exists(soft_path):
-        devive = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-        wav16k = librosa.resample(wav, orig_sr=sampling_rate, target_sr=16000)
-        wav16k = torch.from_numpy(wav16k).to(devive)
-        c = utils.get_hubert_content(hmodel, wav_16k_tensor=wav16k)
-        torch.save(c.cpu(), soft_path)
-    f0_path = filename + ".f0.npy"
-    if not os.path.exists(f0_path):
-        f0 = utils.compute_f0_dio(wav, sampling_rate=sampling_rate, hop_length=hop_length)
-        np.save(f0_path, f0)
-def process_batch(filenames):
-    print("Loading hubert for content...")
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    hmodel = utils.get_hubert_model().to(device)
-    print("Loaded hubert.")
-    for filename in tqdm(filenames):
-        process_one(filename, hmodel)
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--in_dir", type=str, default="dataset/44k", help="path to input dir")
-    args = parser.parse_args()
-    filenames = glob(f'{args.in_dir}/*/*.wav', recursive=True)  # [:10]
-    shuffle(filenames)
-    multiprocessing.set_start_method('spawn')
-    num_processes = 1
-    chunk_size = int(math.ceil(len(filenames) / num_processes))
-    chunks = [filenames[i:i + chunk_size] for i in range(0, len(filenames), chunk_size)]
-    print([len(c) for c in chunks])
-    processes = [multiprocessing.Process(target=process_batch, args=(chunk,)) for chunk in chunks]
-    for p in processes:
-        p.start()

resample.py DELETED Viewed

@@ -1,48 +0,0 @@
-import os
-import argparse
-import librosa
-import numpy as np
-from multiprocessing import Pool, cpu_count
-from scipy.io import wavfile
-from tqdm import tqdm
-def process(item):
-    spkdir, wav_name, args = item
-    # speaker 's5', 'p280', 'p315' are excluded,
-    speaker = spkdir.replace("\\", "/").split("/")[-1]
-    wav_path = os.path.join(args.in_dir, speaker, wav_name)
-    if os.path.exists(wav_path) and '.wav' in wav_path:
-        os.makedirs(os.path.join(args.out_dir2, speaker), exist_ok=True)
-        wav, sr = librosa.load(wav_path, None)
-        wav, _ = librosa.effects.trim(wav, top_db=20)
-        peak = np.abs(wav).max()
-        if peak > 1.0:
-            wav = 0.98 * wav / peak
-        wav2 = librosa.resample(wav, orig_sr=sr, target_sr=args.sr2)
-        wav2 /= max(wav2.max(), -wav2.min())
-        save_name = wav_name
-        save_path2 = os.path.join(args.out_dir2, speaker, save_name)
-        wavfile.write(
-            save_path2,
-            args.sr2,
-            (wav2 * np.iinfo(np.int16).max).astype(np.int16)
-        )
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--sr2", type=int, default=44100, help="sampling rate")
-    parser.add_argument("--in_dir", type=str, default="./dataset_raw", help="path to source dir")
-    parser.add_argument("--out_dir2", type=str, default="./dataset/44k", help="path to target dir")
-    args = parser.parse_args()
-    processs = cpu_count()-2 if cpu_count() >4 else 1
-    pool = Pool(processes=processs)
-    for speaker in os.listdir(args.in_dir):
-        spk_dir = os.path.join(args.in_dir, speaker)
-        if os.path.isdir(spk_dir):
-            print(spk_dir)
-            for _ in tqdm(pool.imap_unordered(process, [(spk_dir, i, args) for i in os.listdir(spk_dir) if i.endswith("wav")])):
-                pass

spec_gen.py DELETED Viewed

@@ -1,22 +0,0 @@
-from data_utils import TextAudioSpeakerLoader
-import json
-from tqdm import tqdm
-from utils import HParams
-config_path = 'configs/config.json'
-with open(config_path, "r") as f:
-    data = f.read()
-config = json.loads(data)
-hps = HParams(**config)
-train_dataset = TextAudioSpeakerLoader("filelists/train.txt", hps)
-test_dataset = TextAudioSpeakerLoader("filelists/test.txt", hps)
-eval_dataset = TextAudioSpeakerLoader("filelists/val.txt", hps)
-for _ in tqdm(train_dataset):
-    pass
-for _ in tqdm(eval_dataset):
-    pass
-for _ in tqdm(test_dataset):
-    pass

train.py DELETED Viewed

@@ -1,297 +0,0 @@
-import logging
-logging.getLogger('matplotlib').setLevel(logging.WARNING)
-import os
-import json
-import argparse
-import itertools
-import math
-import torch
-from torch import nn, optim
-from torch.nn import functional as F
-from torch.utils.data import DataLoader
-from torch.utils.tensorboard import SummaryWriter
-import torch.multiprocessing as mp
-import torch.distributed as dist
-from torch.nn.parallel import DistributedDataParallel as DDP
-from torch.cuda.amp import autocast, GradScaler
-import modules.commons as commons
-import utils
-from data_utils import TextAudioSpeakerLoader, TextAudioCollate
-from models import (
-    SynthesizerTrn,
-    MultiPeriodDiscriminator,
-)
-from modules.losses import (
-    kl_loss,
-    generator_loss, discriminator_loss, feature_loss
-)
-from modules.mel_processing import mel_spectrogram_torch, spec_to_mel_torch
-torch.backends.cudnn.benchmark = True
-global_step = 0
-# os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'INFO'
-def main():
-    """Assume Single Node Multi GPUs Training Only"""
-    assert torch.cuda.is_available(), "CPU training is not allowed."
-    hps = utils.get_hparams()
-    n_gpus = torch.cuda.device_count()
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = hps.train.port
-    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
-def run(rank, n_gpus, hps):
-    global global_step
-    if rank == 0:
-        logger = utils.get_logger(hps.model_dir)
-        logger.info(hps)
-        utils.check_git_hash(hps.model_dir)
-        writer = SummaryWriter(log_dir=hps.model_dir)
-        writer_eval = SummaryWriter(log_dir=os.path.join(hps.model_dir, "eval"))
-    # for pytorch on win, backend use gloo
-    dist.init_process_group(backend=  'gloo' if os.name == 'nt' else 'nccl', init_method='env://', world_size=n_gpus, rank=rank)
-    torch.manual_seed(hps.train.seed)
-    torch.cuda.set_device(rank)
-    collate_fn = TextAudioCollate()
-    train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps)
-    train_loader = DataLoader(train_dataset, num_workers=8, shuffle=False, pin_memory=True,
-                              batch_size=hps.train.batch_size,collate_fn=collate_fn)
-    if rank == 0:
-        eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files, hps)
-        eval_loader = DataLoader(eval_dataset, num_workers=1, shuffle=False,
-                                 batch_size=1, pin_memory=False,
-                                 drop_last=False, collate_fn=collate_fn)
-    net_g = SynthesizerTrn(
-        hps.data.filter_length // 2 + 1,
-        hps.train.segment_size // hps.data.hop_length,
-        **hps.model).cuda(rank)
-    net_d = MultiPeriodDiscriminator(hps.model.use_spectral_norm).cuda(rank)
-    optim_g = torch.optim.AdamW(
-        net_g.parameters(),
-        hps.train.learning_rate,
-        betas=hps.train.betas,
-        eps=hps.train.eps)
-    optim_d = torch.optim.AdamW(
-        net_d.parameters(),
-        hps.train.learning_rate,
-        betas=hps.train.betas,
-        eps=hps.train.eps)
-    net_g = DDP(net_g, device_ids=[rank])  # , find_unused_parameters=True)
-    net_d = DDP(net_d, device_ids=[rank])
-    skip_optimizer = True
-    try:
-        _, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "G_*.pth"), net_g,
-                                                   optim_g, skip_optimizer)
-        _, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "D_*.pth"), net_d,
-                                                   optim_d, skip_optimizer)
-        global_step = (epoch_str - 1) * len(train_loader)
-    except:
-        print("load old checkpoint failed...")
-        epoch_str = 1
-        global_step = 0
-    if skip_optimizer:
-        epoch_str = 1
-        global_step = 0
-    scheduler_g = torch.optim.lr_scheduler.ExponentialLR(optim_g, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
-    scheduler_d = torch.optim.lr_scheduler.ExponentialLR(optim_d, gamma=hps.train.lr_decay, last_epoch=epoch_str - 2)
-    scaler = GradScaler(enabled=hps.train.fp16_run)
-    for epoch in range(epoch_str, hps.train.epochs + 1):
-        if rank == 0:
-            train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
-                               [train_loader, eval_loader], logger, [writer, writer_eval])
-        else:
-            train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler,
-                               [train_loader, None], None, None)
-        scheduler_g.step()
-        scheduler_d.step()
-def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler, loaders, logger, writers):
-    net_g, net_d = nets
-    optim_g, optim_d = optims
-    scheduler_g, scheduler_d = schedulers
-    train_loader, eval_loader = loaders
-    if writers is not None:
-        writer, writer_eval = writers
-    # train_loader.batch_sampler.set_epoch(epoch)
-    global global_step
-    net_g.train()
-    net_d.train()
-    for batch_idx, items in enumerate(train_loader):
-        c, f0, spec, y, spk, lengths, uv = items
-        g = spk.cuda(rank, non_blocking=True)
-        spec, y = spec.cuda(rank, non_blocking=True), y.cuda(rank, non_blocking=True)
-        c = c.cuda(rank, non_blocking=True)
-        f0 = f0.cuda(rank, non_blocking=True)
-        uv = uv.cuda(rank, non_blocking=True)
-        lengths = lengths.cuda(rank, non_blocking=True)
-        mel = spec_to_mel_torch(
-            spec,
-            hps.data.filter_length,
-            hps.data.n_mel_channels,
-            hps.data.sampling_rate,
-            hps.data.mel_fmin,
-            hps.data.mel_fmax)
-        with autocast(enabled=hps.train.fp16_run):
-            y_hat, ids_slice, z_mask, \
-            (z, z_p, m_p, logs_p, m_q, logs_q), pred_lf0, norm_lf0, lf0 = net_g(c, f0, uv, spec, g=g, c_lengths=lengths,
-                                                                                spec_lengths=lengths)
-            y_mel = commons.slice_segments(mel, ids_slice, hps.train.segment_size // hps.data.hop_length)
-            y_hat_mel = mel_spectrogram_torch(
-                y_hat.squeeze(1),
-                hps.data.filter_length,
-                hps.data.n_mel_channels,
-                hps.data.sampling_rate,
-                hps.data.hop_length,
-                hps.data.win_length,
-                hps.data.mel_fmin,
-                hps.data.mel_fmax
-            )
-            y = commons.slice_segments(y, ids_slice * hps.data.hop_length, hps.train.segment_size)  # slice
-            # Discriminator
-            y_d_hat_r, y_d_hat_g, _, _ = net_d(y, y_hat.detach())
-            with autocast(enabled=False):
-                loss_disc, losses_disc_r, losses_disc_g = discriminator_loss(y_d_hat_r, y_d_hat_g)
-                loss_disc_all = loss_disc
-        optim_d.zero_grad()
-        scaler.scale(loss_disc_all).backward()
-        scaler.unscale_(optim_d)
-        grad_norm_d = commons.clip_grad_value_(net_d.parameters(), None)
-        scaler.step(optim_d)
-        with autocast(enabled=hps.train.fp16_run):
-            # Generator
-            y_d_hat_r, y_d_hat_g, fmap_r, fmap_g = net_d(y, y_hat)
-            with autocast(enabled=False):
-                loss_mel = F.l1_loss(y_mel, y_hat_mel) * hps.train.c_mel
-                loss_kl = kl_loss(z_p, logs_q, m_p, logs_p, z_mask) * hps.train.c_kl
-                loss_fm = feature_loss(fmap_r, fmap_g)
-                loss_gen, losses_gen = generator_loss(y_d_hat_g)
-                loss_lf0 = F.mse_loss(pred_lf0, lf0)
-                loss_gen_all = loss_gen + loss_fm + loss_mel + loss_kl + loss_lf0
-        optim_g.zero_grad()
-        scaler.scale(loss_gen_all).backward()
-        scaler.unscale_(optim_g)
-        grad_norm_g = commons.clip_grad_value_(net_g.parameters(), None)
-        scaler.step(optim_g)
-        scaler.update()
-        if rank == 0:
-            if global_step % hps.train.log_interval == 0:
-                lr = optim_g.param_groups[0]['lr']
-                losses = [loss_disc, loss_gen, loss_fm, loss_mel, loss_kl]
-                logger.info('Train Epoch: {} [{:.0f}%]'.format(
-                    epoch,
-                    100. * batch_idx / len(train_loader)))
-                logger.info([x.item() for x in losses] + [global_step, lr])
-                scalar_dict = {"loss/g/total": loss_gen_all, "loss/d/total": loss_disc_all, "learning_rate": lr,
-                               "grad_norm_d": grad_norm_d, "grad_norm_g": grad_norm_g}
-                scalar_dict.update({"loss/g/fm": loss_fm, "loss/g/mel": loss_mel, "loss/g/kl": loss_kl,
-                                    "loss/g/lf0": loss_lf0})
-                # scalar_dict.update({"loss/g/{}".format(i): v for i, v in enumerate(losses_gen)})
-                # scalar_dict.update({"loss/d_r/{}".format(i): v for i, v in enumerate(losses_disc_r)})
-                # scalar_dict.update({"loss/d_g/{}".format(i): v for i, v in enumerate(losses_disc_g)})
-                image_dict = {
-                    "slice/mel_org": utils.plot_spectrogram_to_numpy(y_mel[0].data.cpu().numpy()),
-                    "slice/mel_gen": utils.plot_spectrogram_to_numpy(y_hat_mel[0].data.cpu().numpy()),
-                    "all/mel": utils.plot_spectrogram_to_numpy(mel[0].data.cpu().numpy()),
-                    "all/lf0": utils.plot_data_to_numpy(lf0[0, 0, :].cpu().numpy(),
-                                                          pred_lf0[0, 0, :].detach().cpu().numpy()),
-                    "all/norm_lf0": utils.plot_data_to_numpy(lf0[0, 0, :].cpu().numpy(),
-                                                               norm_lf0[0, 0, :].detach().cpu().numpy())
-                }
-                utils.summarize(
-                    writer=writer,
-                    global_step=global_step,
-                    images=image_dict,
-                    scalars=scalar_dict
-                )
-            if global_step % hps.train.eval_interval == 0:
-                evaluate(hps, net_g, eval_loader, writer_eval)
-                utils.save_checkpoint(net_g, optim_g, hps.train.learning_rate, epoch,
-                                      os.path.join(hps.model_dir, "G_{}.pth".format(global_step)), hps.train.eval_interval, global_step)
-                utils.save_checkpoint(net_d, optim_d, hps.train.learning_rate, epoch,
-                                      os.path.join(hps.model_dir, "D_{}.pth".format(global_step)),  hps.train.eval_interval, global_step)
-        global_step += 1
-    if rank == 0:
-        logger.info('====> Epoch: {}'.format(epoch))
-def evaluate(hps, generator, eval_loader, writer_eval):
-    generator.eval()
-    image_dict = {}
-    audio_dict = {}
-    with torch.no_grad():
-        for batch_idx, items in enumerate(eval_loader):
-            c, f0, spec, y, spk, _, uv = items
-            g = spk[:1].cuda(0)
-            spec, y = spec[:1].cuda(0), y[:1].cuda(0)
-            c = c[:1].cuda(0)
-            f0 = f0[:1].cuda(0)
-            uv= uv[:1].cuda(0)
-            mel = spec_to_mel_torch(
-                spec,
-                hps.data.filter_length,
-                hps.data.n_mel_channels,
-                hps.data.sampling_rate,
-                hps.data.mel_fmin,
-                hps.data.mel_fmax)
-            y_hat = generator.module.infer(c, f0, uv, g=g)
-            y_hat_mel = mel_spectrogram_torch(
-                y_hat.squeeze(1).float(),
-                hps.data.filter_length,
-                hps.data.n_mel_channels,
-                hps.data.sampling_rate,
-                hps.data.hop_length,
-                hps.data.win_length,
-                hps.data.mel_fmin,
-                hps.data.mel_fmax
-            )
-            audio_dict.update({
-                f"gen/audio_{batch_idx}": y_hat[0],
-                f"gt/audio_{batch_idx}": y[0]
-            })
-        image_dict.update({
-            f"gen/mel": utils.plot_spectrogram_to_numpy(y_hat_mel[0].cpu().numpy()),
-            "gt/mel": utils.plot_spectrogram_to_numpy(mel[0].cpu().numpy())
-        })
-    utils.summarize(
-        writer=writer_eval,
-        global_step=global_step,
-        images=image_dict,
-        audios=audio_dict,
-        audio_sampling_rate=hps.data.sampling_rate
-    )
-    generator.train()
-if __name__ == "__main__":
-    main()

utils.py CHANGED Viewed

@@ -222,7 +222,7 @@ def load_checkpoint(checkpoint_path, model, optimizer=None, skip_optimizer=False
     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
     iteration = checkpoint_dict['iteration']
     learning_rate = checkpoint_dict['learning_rate']
-    if optimizer is not None and not skip_optimizer:
         optimizer.load_state_dict(checkpoint_dict['optimizer'])
     saved_state_dict = checkpoint_dict['model']
     if hasattr(model, 'module'):
@@ -250,7 +250,7 @@ def load_checkpoint(checkpoint_path, model, optimizer=None, skip_optimizer=False
     return model, optimizer, learning_rate, iteration
-def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path, val_steps, current_step):
   logger.info("Saving model and optimizer state at iteration {} to {}".format(
     iteration, checkpoint_path))
   if hasattr(model, 'module'):
@@ -261,14 +261,8 @@ def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path,
               'iteration': iteration,
               'optimizer': optimizer.state_dict(),
               'learning_rate': learning_rate}, checkpoint_path)
-  if current_step >= val_steps * 3:
-    to_del_ckptname = checkpoint_path.replace(str(current_step), str(current_step - val_steps * 3))
-    if os.path.exists(to_del_ckptname):
-        os.remove(to_del_ckptname)
-        print("Removing ", to_del_ckptname)
-def clean_checkpoints(path_to_models='logs/48k/', n_ckpts_to_keep=2, sort_by_time=True):
   """Freeing up space by deleting saved ckpts
   Arguments:

     checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
     iteration = checkpoint_dict['iteration']
     learning_rate = checkpoint_dict['learning_rate']
+    if optimizer is not None and not skip_optimizer and checkpoint_dict['optimizer'] is not None:
         optimizer.load_state_dict(checkpoint_dict['optimizer'])
     saved_state_dict = checkpoint_dict['model']
     if hasattr(model, 'module'):
     return model, optimizer, learning_rate, iteration
+def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoint_path):
   logger.info("Saving model and optimizer state at iteration {} to {}".format(
     iteration, checkpoint_path))
   if hasattr(model, 'module'):
               'iteration': iteration,
               'optimizer': optimizer.state_dict(),
               'learning_rate': learning_rate}, checkpoint_path)
+def clean_checkpoints(path_to_models='logs/44k/', n_ckpts_to_keep=2, sort_by_time=True):
   """Freeing up space by deleting saved ckpts
   Arguments: