Indri
Collection
Multimodal audio LMs for TTS, ASR, and voice cloning
•
2 items
•
Updated
Platform | Link |
---|---|
🌎 Live Demo | indrivoice.ai |
𝕏 Twitter | @11mlabs |
🐱 GitHub | Indri Repository |
🤗 Hugging Face (Collection) | Indri collection |
📝 Release Blog | Release Blog |
Indri is a series of audio models that can do TTS, ASR, and audio continuation. This is the medium sized model (350M) in our series and supports TTS tasks in 2 languages:
indri-0.1-350m-tts
is a novel, small, and lightweight TTS model based on the transformer architecture.
It models audio as tokens and can generate high-quality audio with consistent style cloning of the speaker.
Text | Sample |
---|---|
अतीत गौरवशाली, वर्तमान आशावादी, भविष्य उज्जवल | |
भाइयों और बहनों, ये हमारा सौभाग्य है कि हम सब मिलकर इस महान देश को नई ऊंचाइयों पर ले जाने का सपना देख रहे हैं। | |
Hello दोस्तों, future of speech technology mein अपका स्वागत है | |
Artificial Intelligence's collaborative hub: Transforming Machine Learning together | |
Intelligent machines processing data at lightning-fast electronic speeds |
Here's a brief of how the model works:
Please read our blog here for more technical details on how it was built.
Use the code below to get started with the model. Pipelines are the best way to get started with the model.
import torch
import torchaudio
from transformers import pipeline
model_id = '11mlabs/indri-0.1-350m-tts'
task = 'indri-tts'
pipe = pipeline(
task,
model=model_id,
device=torch.device('cuda:0'), # Update this based on your hardware,
trust_remote_code=True
)
output = pipe(['Hi, my name is Indri and I like to talk.'])
torchaudio.save('output.wav', output[0]['audio'][0], sample_rate=24000)
git clone https://github.com/cmeraki/indri.git
cd indri
pip install -r requirements.txt
# Install ffmpeg (for Mac/Windows, refer here: https://www.ffmpeg.org/download.html)
sudo apt update -y
sudo apt upgrade -y
sudo apt install ffmpeg -y
python -m inference --model_path 11mlabs/indri-0.1-350m-tts --device cuda:0 --port 8000
If you use this model in your research, please cite:
@misc{indri-multimodal-alm,
author = {11mlabs},
title = {Indri: Multimodal audio language model},
year = {2024},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/cmeraki/indri}},
email = {[email protected]}
}
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and
Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
@misc{Silero VAD,
author = {Silero Team},
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad}},
commit = {insert_some_commit_here},
email = {[email protected]}
}
Base model
openai-community/gpt2