Text-to-Speech
Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages.
Input
I love audio models on the Hub!
About Text-to-Speech
Use Cases
Text-to-Speech (TTS) models can be used in any speech-enabled application that requires converting text to speech imitating human voice.
Voice Assistants
TTS models are used to create voice assistants on smart devices. These models are a better alternative compared to concatenative methods where the assistant is built by recording sounds and mapping them, since the outputs in TTS models contain elements in natural speech such as emphasis.
Announcement Systems
TTS models are widely used in airport and public transportation announcement systems to convert the announcement of a given text into speech.
Inference Endpoints
The Hub contains over 1500 TTS models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using Inference Endpoints. Here is a simple code snippet to get you started:
import json
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/microsoft/speecht5_tts"
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response
output = query({"text_inputs": "Max is the best doggo."})
You can also use libraries such as espnet or transformers if you want to handle the Inference directly.
Direct Inference
Now, you can also use the Text-to-Speech pipeline in Transformers to synthesise high quality voice.
from transformers import pipeline
synthesizer = pipeline("text-to-speech", "suno/bark")
synthesizer("Look I am generating speech in three lines of code!")
You can use huggingface.js to infer summarization models on Hugging Face Hub.
import { HfInference } from "@huggingface/inference";
const inference = new HfInference(HF_TOKEN);
await inference.textToSpeech({
model: "facebook/mms-tts",
inputs: "text to generate speech from",
});
Useful Resources
Compatible libraries
Note A prompt based, powerful TTS model.
Note A powerful TTS model that supports English and Chinese.
Note A massively multi-lingual TTS model.
Note A powerful TTS model.
Note A Llama based TTS model.
Note 10K hours of multi-speaker English dataset.
Note Multi-speaker English dataset.
Note Mulit-lingual dataset.
Note An application for generate highly realistic, multilingual speech.
Note An application on XTTS, a voice generation model that lets you clone voices into different languages.
Note An application that generates speech in different styles in English and Chinese.
Note An application that synthesizes emotional speech for diverse speaker prompts.
- mel cepstral distortion
- The Mel Cepstral Distortion (MCD) metric is used to calculate the quality of generated speech.