Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for Diva Llama 3

This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details will be in a paper [COMING SOON]!

See the model in action compared to SALMONN and Qwen-Audio at value-nlp.github.io/DiVA-Demo.

Citation

No Publication As of Yet, But If You Use Please Cite the Below BibTeX:

    @misc{held2024diva,
      author="Held, Will and Zhang, Yanzhe and Ryan, Michael and Shi, Weiyan and Li, Ella and Yang, Diyi",
      title="Distilling an End-to-End Voice Assistant from Speech Recognition Data",
      year="2024",
      publisher="HuggingFace",
    }
    

Table of Contents

Training Details

Training Data

This model was trained on the CommonVoice corpus.

Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

Environmental Impact

  • Hardware Type: V4-32 TPU
  • Hours used: 8 Hours
  • Cloud Provider: Google Cloud.
  • Compute Region: US Central C

Hardware

This model was trained on at V4 TPU on Google Cloud.

Software

This model was trained with Levanter

Model Card Authors [optional]

Will Held

Model Card Contact

[email protected]

Downloads last month
14
Safetensors
Model size
2.49B params
Tensor type
F32
·
Inference API
Inference API (serverless) does not yet support model repos that contain custom code.