File size: 4,040 Bytes
d4a6ec7
 
 
 
 
 
 
 
 
b748f45
d4a6ec7
 
 
 
 
914cfa8
d4a6ec7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
914cfa8
 
d4a6ec7
 
 
 
 
 
6b04cf2
 
d4a6ec7
 
 
cb37282
 
 
 
d4a6ec7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
datasets:
- mythicinfinity/libritts_r
- MikhailT/hifi-tts
- speechcolab/gigaspeech
language:
- en
base_model:
- Qwen/Qwen2-0.5B
pipeline_tag: text-to-audio
---
# <center>Viitor-Voice</center>
### <center>A LLM based TTS Engine</center>

<p align="center">
  <img src="./post.webp" alt="Viitor-Voice Cover">
</p>

## Features

- **Lightweight Design**  

  The model is simple and efficient, compatible with most LLM inference engines. With only 0.5B parameters, it achieves extreme optimization of computational resources while maintaining high performance. This design allows the model to be deployed not only on servers but also on mobile devices and edge computing environments, meeting diverse deployment needs.

- **Real-time Streaming Output, Low Latency Experience**  

  The model supports real-time speech generation, suitable for applications that demand low latency. On the Tesla T4 platform, it achieves an industry-leading first-frame latency of 200ms, providing users with nearly imperceptible instant feedback, ideal for interactive applications requiring quick response.

- **Rich Voice Library**  

  Offers more than 300 different voice options, allowing you to choose the most suitable speech style according to your needs and preferences. Whether it’s a formal business presentation or casual entertainment content, the model provides perfect voice matching.

- **Flexible Speech Rate Adjustment**  

  The model supports natural variations in speech rate, allowing users to easily adjust it based on content requirements and audience preferences. Whether speeding up for efficient information delivery or slowing down to enhance emotional depth, it maintains natural speech fluency.

- **Zero-shot Voice Cloning (Under Research)**  

  Decoder-only architecture naturally supports Zero-shot cloning, with future support for rapid voice cloning based on minimal voice samples.

---

## Output Samples

Below are examples of speech generated by this project:

- Example 1: [Female Voice - 1](./female_normal.wav)
- Example 2: [Male Voice - 1](./male_normal.wav)

---

## Environment Setup

```commandline
git clone https://github.com/MrWaterZhou/viitor-voice.git
cd viitor-voice
conda create -n viitor_voice python=3.10
conda activate viitor_voice
pip install -r requirements.txt

### Due to the issue with vllm's tokenizer length calculation, the token limit cannot take effect.
python_package_path=`pip show pip | egrep Location | awk -F ' ' '{print $2}'`
cp viitor_voice/utils/patch.py $python_package_path/vllm/entrypoints/openai/logits_processors.py
```

---

## Inference

### Offline Inference

```python
from viitor_voice.utils.offline_inference import OfflineInference
import torchaudio

tts_engine = OfflineInference(model_path='ZzWater/viitor-voice-en',
                              config_path='viitor_voice/inference_configs/en.json')
text_list = [
    "Isn't it fascinating to think about the endless possibilities that lie within the pages of a book. every time you turn a page, you're diving into a new world ripe with potential for discovery, and wonder what stories will you uncover today."]
# list valid speakers
print(tts_engine.prompt_map.keys())
audios = tts_engine.batch_infer(text_list=text_list, speaker=['1'], speed=2)
torchaudio.save('test.wav', audios[0], 24000)
```

### Streaming Inference (TODO)

---
## Training (TODO)

## References

- [SNAC](https://github.com/hubertsiuzdak/snac)
- [mini-omni](https://github.com/gpt-omni/mini-omni)
- [open-gpt-4-o](https://laion.ai/notes/open-gpt-4-o/)

## License

This project is licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).  
You are free to share and modify the code of this project for non-commercial purposes, under the following conditions:

1. **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
2. **Non-Commercial**: You may not use the material for commercial purposes.

**Copyright Notice:**  
© 2024 Livedata. All Rights Reserved.