VALL-E Korean Model

Introduction

The VALL-E Korean model is an implementation of the VALL-E architecture designed for the Korean language. This model serves as a zero-shot text-to-speech synthesizer, allowing users to generate natural-sounding speech from text input in Korean. The model utilizes various components, including the espeak text phonemizer with language='ko' option and the EnCodec audio tokenizer from Facebook Research's EnCodec repository.

Model Details

Architecture: The VALL-E Korean model consists of both ar (autoregressive) and nar (non-autoregressive) models.
Hidden Dimensions: The model has a hidden dimension of 1024.
Transformer Layers: It comprises 12 transformer layers.
Attention Heads: Each layer has 16 attention heads.

Training Data

The training data for the VALL-E Korean model consists of approximately 2000 hours of Korean audio-text pairs. This dataset was sourced from AI-Hub 한국인 대화음성.

Example Usage

For an example of how to use the VALL-E Korean model, you can refer to the provided Google Colab notebook: tester_colab.ipynb. This notebook demonstrates how to perform text-to-speech synthesis using the model. Additionally, the example incorporates the vocos decoder from Plachtaa's VALL-E repository.

References

For more information and details on using the model, please refer to the provided references and resources.

Updated

We trained the model on 8k dataset from AI Hub, which is uploaded as v1. The model has better performance when the clean audio source (e.g., voice-source), however, it may not work well when the audio source is bad. Therefore, the both v0 and v1 are maintained.