fixie-ai
/

ultravox-v0_3

Feature Extraction

Model card Files Files and versions Community

farzadab commited on Aug 13

Commit

498d19e

•

1 Parent(s): d0247cf

Update README.md

Files changed (1) hide show

README.md +11 -35

README.md CHANGED Viewed

@@ -4,10 +4,8 @@ language:
 license: mit
 library_name: transformers
 datasets:
-- fnlp/AnyInstruct
-- fixie-ai/boolq-audio
-- fixie-ai/soda-audio
-- speechcolab/gigaspeech
 ---
 # Model Card for Ultravox
@@ -67,11 +65,12 @@ pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
 The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
-The multi-modal adapter is first trained (while keeping backbones frozen) in stage 1 and then in stage 2. Llama3.1 is kept frozen.
 ### Training Data
-Training dataset is a mix of ASR datasets (Gigaspeech), instruction-following and QA data (AnyInstruct and an extended version of BoolQ), and conversational data (SODA with alternative generations for last two turns).
 ### Training Procedure
@@ -82,8 +81,7 @@ Supervised speech to audio finetuning. For more info, see [training code in Ultr
 #### Training Hyperparameters
 - **Training regime:** BF16 mixed precision training
-- **Hardward used:** 8x A100-40GB GPUs
-- **LLM LoRA Rank:** 64
 #### Speeds, Sizes, Times
@@ -93,30 +91,8 @@ Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for d
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary

 license: mit
 library_name: transformers
 datasets:
+- fixie-ai/librispeech_asr
+- fixie-ai/common_voice_17_0
 ---
 # Model Card for Ultravox
 The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
+Only the multi-modal adapter is trained, while Whisper encoder and Llama are kept frozen.
+We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
 ### Training Data
+Training dataset is a mix of ASR datasets, extended by adding a "continuation" generated by Llama 3.1 8B.
 ### Training Procedure
 #### Training Hyperparameters
 - **Training regime:** BF16 mixed precision training
+- **Hardward used:** 8x H100 GPUs
 #### Speeds, Sizes, Times
 ## Evaluation
+|                   | Ultravox v0.2 | Ultravox v0.3 | Whisper-Llama3.1 | Llama3.1 (text-only) |
+|-------------------|---------------|---------------|------------------|----------------------|
+| en_de (BLEU)      | 12.07         | 22.68         | 24.89            | 31.95                |
+| es_en (BLEU)      | 15.17         | 24.10         | 28.67            | 38.28                |
+| LibriSpeech clean.test (WER) | 6.07 | 6.67 | 3.4 | - |