File size: 2,532 Bytes
124837e
 
5de96c5
124837e
5de96c5
124837e
5de96c5
124837e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Model Card for Diva Llama 3

This is an ablation of our Distilled Voice Assistant (DiVA) model which can handle speech and text as inputs. This ablation is trained using only distillation loss as described in the ablations here: https://huggingface.co/papers/2410.02678

Weights and Biases Run: https://wandb.ai/i18nlp/DiVA%20Training%20Runs/runs/8i1dd47i?nw=nwuserheld
## Citation
This is the distillation only model from https://huggingface.co/papers/2410.02678:
**BibTeX:**

```
	@misc{held2024diva,
	  author="Held, Will and Zhang, Yanzhe and Ryan, Michael and Shi, Weiyan and Li, Ella and Yang, Diyi",
	  title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
	  year="2024",
	  publisher="HuggingFace",
	}
    
```

##  Table of Contents

- [Model Card for DiVA Llama 3](#model-card-for-DiVA-Llama-3)
- [Citation](#citation)
- [Table of Contents](#table-of-contents)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
  - [Model Architecture and Objective](#model-architecture-and-objective)
  - [Compute Infrastructure](#compute-infrastructure)
    - [Hardware](#hardware)
    - [Software](#software)
- [Model Card Contact](#model-card-contact)

## Training Details

### Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

This model was trained on the [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) corpus.


### Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

### Environmental Impact

- **Hardware Type:** V4-32 TPU
- **Hours used:** 8 Hours
- **Cloud Provider:** Google Cloud.
- **Compute Region:** US Central C


### Hardware

This model was trained on at V4 TPU on Google Cloud.

### Software

This model was trained with [Levanter](https://github.com/stanford-crfm/levanter) 


## Model Card Authors [optional]

<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->

Will Held

## Model Card Contact

[email protected]