Update README.md
Browse files
README.md
CHANGED
@@ -4,10 +4,8 @@ language:
|
|
4 |
license: mit
|
5 |
library_name: transformers
|
6 |
datasets:
|
7 |
-
-
|
8 |
-
- fixie-ai/
|
9 |
-
- fixie-ai/soda-audio
|
10 |
-
- speechcolab/gigaspeech
|
11 |
---
|
12 |
|
13 |
# Model Card for Ultravox
|
@@ -67,11 +65,12 @@ pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
|
|
67 |
|
68 |
The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
|
69 |
|
70 |
-
|
|
|
71 |
|
72 |
### Training Data
|
73 |
|
74 |
-
Training dataset is a mix of ASR datasets
|
75 |
|
76 |
|
77 |
### Training Procedure
|
@@ -82,8 +81,7 @@ Supervised speech to audio finetuning. For more info, see [training code in Ultr
|
|
82 |
#### Training Hyperparameters
|
83 |
|
84 |
- **Training regime:** BF16 mixed precision training
|
85 |
-
- **Hardward used:** 8x
|
86 |
-
- **LLM LoRA Rank:** 64
|
87 |
|
88 |
#### Speeds, Sizes, Times
|
89 |
|
@@ -93,30 +91,8 @@ Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for d
|
|
93 |
|
94 |
## Evaluation
|
95 |
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
<!-- This should link to a Dataset Card if possible. -->
|
103 |
-
|
104 |
-
[More Information Needed]
|
105 |
-
|
106 |
-
#### Factors
|
107 |
-
|
108 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
109 |
-
|
110 |
-
[More Information Needed]
|
111 |
-
|
112 |
-
#### Metrics
|
113 |
-
|
114 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
115 |
-
|
116 |
-
[More Information Needed]
|
117 |
-
|
118 |
-
### Results
|
119 |
-
|
120 |
-
[More Information Needed]
|
121 |
-
|
122 |
-
#### Summary
|
|
|
4 |
license: mit
|
5 |
library_name: transformers
|
6 |
datasets:
|
7 |
+
- fixie-ai/librispeech_asr
|
8 |
+
- fixie-ai/common_voice_17_0
|
|
|
|
|
9 |
---
|
10 |
|
11 |
# Model Card for Ultravox
|
|
|
65 |
|
66 |
The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
|
67 |
|
68 |
+
Only the multi-modal adapter is trained, while Whisper encoder and Llama are kept frozen.
|
69 |
+
We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
|
70 |
|
71 |
### Training Data
|
72 |
|
73 |
+
Training dataset is a mix of ASR datasets, extended by adding a "continuation" generated by Llama 3.1 8B.
|
74 |
|
75 |
|
76 |
### Training Procedure
|
|
|
81 |
#### Training Hyperparameters
|
82 |
|
83 |
- **Training regime:** BF16 mixed precision training
|
84 |
+
- **Hardward used:** 8x H100 GPUs
|
|
|
85 |
|
86 |
#### Speeds, Sizes, Times
|
87 |
|
|
|
91 |
|
92 |
## Evaluation
|
93 |
|
94 |
+
| | Ultravox v0.2 | Ultravox v0.3 | Whisper-Llama3.1 | Llama3.1 (text-only) |
|
95 |
+
|-------------------|---------------|---------------|------------------|----------------------|
|
96 |
+
| en_de (BLEU) | 12.07 | 22.68 | 24.89 | 31.95 |
|
97 |
+
| es_en (BLEU) | 15.17 | 24.10 | 28.67 | 38.28 |
|
98 |
+
| LibriSpeech clean.test (WER) | 6.07 | 6.67 | 3.4 | - |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|