VALL-E Korean Model
Introduction
The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from Facebook Research's EnCodec repository.
Model Details
- Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
- Hidden Dimensions: The model has a hidden dimension of 1024.
- Transformer Layers: It comprises 12 transformer layers.
- Attention Heads: Each layer has 16 attention heads.
Training Data
The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from AI-Hub ํ๊ตญ์ธ ๋ํ์์ฑ.
Example Usage
For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: tester_colab.ipynb. This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from Plachtaa's VALL-E repository.
References
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
- VALL-E Repository by lifeiteng
- Enhuiz's VALL-E Repository
- VALL-E-X Repository by Plachtaa
- Vocos
For more information and details on using the model, please refer to the provided references and resources.
Updated
We trained the model on 8k dataset from AI Hub, which is uploaded as v1. The model has better performance when the clean audio source (e.g., voice-source), however, it may not work well when the audio source is bad. Therefore, the both v0 and v1 are maintained.