|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
This model is trained on Google's AudioSet (28GB data) for 1 million steps. (Originally planned 2 million steps, but I'm exploring better training schedule) |
|
|
|
You can regard it as a pretrained base model, which is common in language models but not for vocoders. |
|
|
|
How to load and use this model: |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
from scipy.io.wavfile import write |
|
with torch.no_grad(): |
|
from vocos import Vocos |
|
A = torch.load("./vocos_checkpoint_epoch=464_step=1001610_val_loss=7.1732.ckpt", map_location="cpu") |
|
V = Vocos.from_hparams("./config.yaml") |
|
V.load_state_dict(A['state_dict'], strict=False) |
|
V.eval() |
|
def safe_log(x: torch.Tensor, clip_val: float = 1e-7): |
|
return torch.log(torch.clip(x, min=clip_val)) |
|
voice, sr = torchaudio.load('example.wav') # must be sample_rate=32000 |
|
if sr != 32000: |
|
raise ValueError |
|
mel = torchaudio.transforms.MelSpectrogram( |
|
sample_rate=32000, n_fft=2048, hop_length=1024, n_mels=128, center=True, power=1, |
|
)(voice) |
|
mel = safe_log(mel) |
|
audio = V.decode(mel) |
|
write('out.wav', 32000, audio.flatten().numpy()) |
|
``` |
|
|
|
|
|
|