File size: 5,262 Bytes
5bc0de9
 
 
 
 
fd004f8
 
5bc0de9
 
 
ffa5345
 
 
 
5bc0de9
 
 
 
 
 
 
 
 
 
 
 
61c638f
 
5bc0de9
 
 
 
 
 
 
 
 
4ac712c
 
5bc0de9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
436cc1a
5bc0de9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3e84a5
5bc0de9
 
 
 
 
 
 
 
 
436cc1a
5bc0de9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd004f8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: afl-3.0
language:
- en
- ja
datasets:
- Kutsuya/SinonVoice
---
This model is trained by using [so-vits-svc-fork](https://github.com/voicepaw/so-vits-svc-fork)

Examples:
=========
![Sample](http://sndup.net/rtgt)

Hardware used:
============
* CPU: AMD Ryzen 9 3900X
* RAM: 64 GB
* GPU: 3090 24GB

Acquiring the dataset
=====================
Software used: 
* `ultimatevocalremovergui`
* `Audacity`

The dataset I used for this model: [Dataset](https://huggingface.co/datasets/Kutsuya/SinonVoice/tree/main)

<h3>Step 1</h3>
Find videos, music, podcasts or whatever that contains the voice you want to make a model of. <br>

<h3>Step 2</h3>
Snip out the parts of the videos/music you want to use for the dataset. The clearer the audio, the better. This means no background noise whatsoever. <b>Each file must be a maximum of 10 seconds!</b><br>
You can do this via Audacity or any other software you feel familiar with.<br>
For a decent model, you will need about 100 samples.

<h3>Step 3</h3>

If a sample has a background noise (which it will most likely have), remove it via `ultimatevocalremovergui`

Removing background noises
-----------------------------------
<h3>Installing the requirements</h3>
<h3>Step 1</h3>

Install `ultimatevocalremovergui` by following the following steps:<br>
* `git clone https://github.com/Anjok07/ultimatevocalremovergui`
* `cd ultimatevocalremovergui`
* `nano environment.yml`
* Fill it with the following text:

```name: ultimatevocalremovergui
channels:
  - defaults
dependencies:
  - python=3.10
  - tk
  - pip
  - pip:
    - -r requirements.txt
```
* Save it by pressing `ctrl`+`x` followed by `Y` then press `enter`
* `conda env create -f environment.yml`
* `conda activate ultimatevocalremovergui`
* `python UVR.py`

<h3>Step 2</h3>

The software will now startup (this might take a bit). It will look like this: ![UVR](https://i.imgur.com/UKv2V7J.png)
First we need to download a model like so:
* Click on the wrench icon next to the `Start Processing` button.
* At the top of the new window that opens, click on the tab called `Download Center`
* Select the radio button called `Demucs`
* Select `Demucs v4:  htdemucs_ft`
* Click the download button underneath this combobox

Now that the model is downloaded we are going to remove the background noise from our voice sample. To do this do the following:
* At the top, click the `Select Input` button
* Select your voice sample
* Now click on the `Select Output` button
* <b>IMPORTANT!</b> Your output should be  like this: `dataset_raw/{speaker_id}/**/{wav_file}.{any_format}`, example: `dataset_raw/sinon/wav/sample1.wav`. This folder can be anywhere on your system
* Select a directory where you want the processed file to appear
* Now under the text `CHOOSE PROCESS METHOD` select `Demucs`
* Make sure the model is selected under the text `CHOOSE DEMUCS MODEL`
* Click on `GPU Conversion` to speed up the process
* Now click on `Start Processing` and wait until it is done
* After it's done, navigate to the folder you set as output and listen to it. Does it sound ok? if it does, you are now done, if it doesn't, don't use this file in your dataset

Training the model
=====================
Here is a quick explanation on how I trained this model.

Software used: 

* `so-vits-svc-fork` (The software to morph your voice)
* `qpwgraph` (this is used to reroute the output to another process like Discord or Telegram)

<h3>Step 1</h3>

First, install qprgraph:
* `paru qpwgraph`

Now, clone the so-vits-svc-fork repo:
* `git clone https://github.com/voicepaw/so-vits-svc-fork`

Then, cd into the repo:
* `cd so-vits-svc-fork`

Now, make a conda environment:
* `conda create -n so-vits-svc-fork python=3.10`

Now, activate the conda environment:
* `conda activate so-vits-svc-fork`

Now, install the requirements:
* `python -m pip install -U pip setuptools wheel`
* `pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118`
* `pip install -U so-vits-svc-fork` (This will install the package inside your conda environment, meaning you can run it anywhere on your system as long as you are in your conda environment)

<h3>Step 2</h3>

* Navigate to the directory where you dataset is at, for example, if your dataset is at `/mnt/Shark/Projects/Sinon-Voice/training/dataset_raw/sinon/wav/` navigate to `/mnt/Shark/Projects/Sinon-Voice/training` run the following commands:
* `svc pre-resample`
* `svc pre-config`
* `svc pre-hubert`
* `svc train`

Using the model
=====================
<h3>Step 1</h3>

Now, run the program:<br>
* `svcg`

On the right side in the application that just opened, make sure to set the input device to default (ALSA) and the output also to default (ALSA)
![Example](https://i.imgur.com/ctxVKfT.png)

<h3>Step 2</h3>

* At the top, select your model and config files. These are located in your training folder at: `logs/44k/`

<h3>Step 3</h3>

* You can now tweak some settings, for example the pitch (I recommend a value of 12 to begin with)
* Turn off Auto predict

<h3>Step 4</h3>

* After tweaking the settings to your liking, press the button called `Infer` at the very bottom to start the voice morph

<h3>Additional info</h3>

If nothing happens, take a look at the terminal and act accordingly