Feature Extraction
Transformers
Safetensors
English
ultravox
custom_code
farzadab commited on
Commit
498d19e
1 Parent(s): d0247cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -35
README.md CHANGED
@@ -4,10 +4,8 @@ language:
4
  license: mit
5
  library_name: transformers
6
  datasets:
7
- - fnlp/AnyInstruct
8
- - fixie-ai/boolq-audio
9
- - fixie-ai/soda-audio
10
- - speechcolab/gigaspeech
11
  ---
12
 
13
  # Model Card for Ultravox
@@ -67,11 +65,12 @@ pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
67
 
68
  The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
69
 
70
- The multi-modal adapter is first trained (while keeping backbones frozen) in stage 1 and then in stage 2. Llama3.1 is kept frozen.
 
71
 
72
  ### Training Data
73
 
74
- Training dataset is a mix of ASR datasets (Gigaspeech), instruction-following and QA data (AnyInstruct and an extended version of BoolQ), and conversational data (SODA with alternative generations for last two turns).
75
 
76
 
77
  ### Training Procedure
@@ -82,8 +81,7 @@ Supervised speech to audio finetuning. For more info, see [training code in Ultr
82
  #### Training Hyperparameters
83
 
84
  - **Training regime:** BF16 mixed precision training
85
- - **Hardward used:** 8x A100-40GB GPUs
86
- - **LLM LoRA Rank:** 64
87
 
88
  #### Speeds, Sizes, Times
89
 
@@ -93,30 +91,8 @@ Check out the audio tab on [TheFastest.ai](https://thefastest.ai/?m=audio) for d
93
 
94
  ## Evaluation
95
 
96
- <!-- This section describes the evaluation protocols and provides the results. -->
97
-
98
- ### Testing Data, Factors & Metrics
99
-
100
- #### Testing Data
101
-
102
- <!-- This should link to a Dataset Card if possible. -->
103
-
104
- [More Information Needed]
105
-
106
- #### Factors
107
-
108
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
109
-
110
- [More Information Needed]
111
-
112
- #### Metrics
113
-
114
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
115
-
116
- [More Information Needed]
117
-
118
- ### Results
119
-
120
- [More Information Needed]
121
-
122
- #### Summary
 
4
  license: mit
5
  library_name: transformers
6
  datasets:
7
+ - fixie-ai/librispeech_asr
8
+ - fixie-ai/common_voice_17_0
 
 
9
  ---
10
 
11
  # Model Card for Ultravox
 
65
 
66
  The model uses a pre-trained [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) backbone as well as the encoder part of [Whisper-small](https://huggingface.co/openai/whisper-small).
67
 
68
+ Only the multi-modal adapter is trained, while Whisper encoder and Llama are kept frozen.
69
+ We use a knowledge-distillation loss where Ultravox is trying to match the logits of the text-based Llama backbone.
70
 
71
  ### Training Data
72
 
73
+ Training dataset is a mix of ASR datasets, extended by adding a "continuation" generated by Llama 3.1 8B.
74
 
75
 
76
  ### Training Procedure
 
81
  #### Training Hyperparameters
82
 
83
  - **Training regime:** BF16 mixed precision training
84
+ - **Hardward used:** 8x H100 GPUs
 
85
 
86
  #### Speeds, Sizes, Times
87
 
 
91
 
92
  ## Evaluation
93
 
94
+ | | Ultravox v0.2 | Ultravox v0.3 | Whisper-Llama3.1 | Llama3.1 (text-only) |
95
+ |-------------------|---------------|---------------|------------------|----------------------|
96
+ | en_de (BLEU) | 12.07 | 22.68 | 24.89 | 31.95 |
97
+ | es_en (BLEU) | 15.17 | 24.10 | 28.67 | 38.28 |
98
+ | LibriSpeech clean.test (WER) | 6.07 | 6.67 | 3.4 | - |