gpt-omni
/

mini-omni

speech-to-speech

Model card Files Files and versions Community

ylacombe HF staff commited on Sep 4

Commit

2706f21

•

1 Parent(s): b3a6952

Update README.md

Files changed (1) hide show

README.md +67 -1

README.md CHANGED Viewed

@@ -3,6 +3,9 @@ license: mit
 language:
 - en
 base_model: Qwen/Qwen2-0.5B
 ---
@@ -33,4 +36,67 @@ Mini-Omni is an open-source multimodel large language model that can **hear, tal
 ✅ With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.
-**NOTE**: please refer to https://github.com/gpt-omni/mini-omni for more details.

 language:
 - en
 base_model: Qwen/Qwen2-0.5B
+tags:
+- text-to-speech
+- speech-to-speech
 ---
 ✅ With "Audio-to-Text" and "Audio-to-Audio" **batch inference** to further boost the performance.
+**NOTE**: please refer to the [code repository](https://github.com/gpt-omni/mini-omni) for more details.
+## Install
+Create a new conda environment and install the required packages:
+```sh
+conda create -n omni python=3.10
+conda activate omni
+git clone https://github.com/gpt-omni/mini-omni.git
+cd mini-omni
+pip install -r requirements.txt
+```
+## Quick start
+**Interactive demo**
+- start server
+```sh
+conda activate omni
+cd mini-omni
+python3 server.py --ip '0.0.0.0' --port 60808
+```
+- run streamlit demo
+NOTE: you need to run streamlit locally with PyAudio installed.
+```sh
+pip install PyAudio==0.2.14
+API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
+```
+- run gradio demo
+```sh
+API_URL=http://0.0.0.0:60808/chat python3 webui/omni_gradio.py
+```
+example:
+NOTE: need to unmute first. Gradio seems can not play audio stream instantly, so the latency feels a bit longer.
+https://github.com/user-attachments/assets/29187680-4c42-47ff-b352-f0ea333496d9
+**Local test**
+```sh
+conda activate omni
+cd mini-omni
+# test run the preset audio samples and questions
+python inference.py
+```
+## Acknowledgements
+- [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
+- [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
+- [whisper](https://github.com/openai/whisper/)  for audio encoding.
+- [snac](https://github.com/hubertsiuzdak/snac/)  for audio decoding.
+- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
+- [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.