Text-to-Speech
English
File size: 6,120 Bytes
783509b
 
8e7b84a
 
 
 
 
 
 
783509b
097c9df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e7b84a
 
 
 
 
 
 
 
 
 
097c9df
 
 
 
 
 
aa0f5e5
097c9df
 
 
 
 
 
8426fb8
 
 
 
8e7b84a
 
 
 
 
 
 
 
 
 
 
 
097c9df
8e7b84a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
097c9df
6875147
097c9df
8e7b84a
 
097c9df
233db79
097c9df
6875147
097c9df
8e7b84a
 
 
 
 
 
 
097c9df
 
6875147
097c9df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6875147
8e7b84a
097c9df
 
8e7b84a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
license: mit
datasets:
- ShoukanLabs/AniSpeech
- vctk
- blabble-io/libritts_r
language:
- en
pipeline_tag: text-to-speech
---

<style>

  .TitleContainer {
    background-color: #ffff;
    margin-bottom: 0rem;
    margin-left: auto;
    margin-right: auto;
    width: 40%;
    height: 30%;
    border-radius: 10rem;
    border: 0.5vw solid #ff593e;
    transition: .6s;
  }

  .TitleContainer:hover {
    transform: scale(1.05);
  }

  .VokanLogo {
    margin: auto;
    display: block;
  }

  audio {
	margin: 0.5rem;    
  }
    
  .audio-container {
    display: flex;
    justify-content: center;
    align-items: center;
  }

</style>

<hr>

<div class="TitleContainer" align="center">
      <!--<img src="https://huggingface.co/ShoukanLabs/Vokan/resolve/main/Vokan.gif" class="VokanLogo">-->
      <img src="Vokan.gif" class="VokanLogo">
</div>

<p align="center", style="font-size: 1vw; font-weight: bold; color: #ff593e;">A StyleTTS2 fine-tune, designed for expressiveness.</p>

<hr>

<div class='audio-container'>
  <a align="center" href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a>
  <!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>-->
</div>

**Vokan** is an advanced finetuned **StyleTTS2** model crafted for authentic and expressive zero-shot performance. Designed to serve as a better
base model fo further finetuning in the future!
It leverages a diverse dataset and extensive training to generate high-quality synthesized speech. 
Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts. 
With over 6+ days worth of audio data and 672 diverse and expressive speakers, 
Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance. 
Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space. 
Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration.

You can read more about it on our article on [DagsHub!](dagshub.com/blog/styletts2/)


<hr>
<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Vokan Samples!</p>
<div class='audio-container'>
  <div>
      <audio controls>
        <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%201.wav" type="audio/wav">
      Your browser does not support the audio element.
      </audio>
  </div>
  
  <div>
      <audio controls>
        <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%202.wav" type="audio/wav">
      Your browser does not support the audio element.
      </audio>
  </div>
</div>
<div class='audio-container'>
  <div>
      <audio controls>
        <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%203.wav" type="audio/wav">
      Your browser does not support the audio element.
      </audio>
  </div>
  <div>
      <audio controls>
        <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%204.wav" type="audio/wav">
      Your browser does not support the audio element.
      </audio>
  </div>
</div>
<hr>

<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Acknowledgements!</p>

- **[DagsHub](https://dagshub.com):** Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular!
- **[camenduru](https://github.com/camenduru):** Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow!

<hr>

<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Conclusion!</p>

V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support! 
This is where you come in, if you have any large single speaker datasets you'd like to contribute, 
in any langauge, you can contribute to our **Vokan dataset**. A large **community dataset** that combines a bunch of 
smaller single speaker datasets to create one big multispeaker one. 
You can upload your uberduck or [FakeYou](https://fakeyou.com/) compliant datasets via the 
**[Vokan](https://huggingface.co/ShoukanLabs/Vokan)** bot on the **[ShoukanLabs Discord Server](https://discord.gg/hdVeretude)**. 
The more data we have, the better the models we produce will be!
<hr>

<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Citations!</p>

```citations
@misc{li2023styletts,
      title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
      author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
      year={2023},
      eprint={2306.07691},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
The Centre for Speech Technology Research (CSTR),
University of Edinburgh
```

<p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">License!</p>

```
MIT
```