pretrained model for audio emotion classification

by PranavB - opened Feb 21, 2023

Feb 21, 2023

•

edited Feb 22, 2023

Is there any pre-trained model for audio emotion classification? if not then is anyone interested to collaborate with me to build one?

nielsr

Massachusetts Institute of Technology org Feb 22, 2023

Cc'ing @sanchit-gandhi here

PranavB

Feb 23, 2023

Thanks @nielsr

sanchit-gandhi

Mar 3, 2023

•

edited Mar 3, 2023

Hey @PranavB ! I've tried fine-tuning AST for speech related tasks and unfortunately the performance is not very good 😅 My conclusion is that there's too big a domain mis-match between the AST pre-training data (generic audio sounds) and speech. You can see the checkpoint I trained for language identification on the FLEURS dataset here: https://huggingface.co/sanchit-gandhi/ast-fleurs-langid-max-length-2048/tensorboard
Eval accuracy is only 17%...

IMO there's much more promise in fine-tuning Whisper, e.g. on FLEURS I get 88% eval accuracy after just 3 epochs: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id

I think emotion classification would be cool! You can probably copy over the scripts that I used for Whisper language identification and change the dataset to an emotion classification one

taurasAI

Mar 12, 2023

@PranavB were you able to create emotion classifier on input audio?

ismu

Apr 6, 2023

Hey @PranavB
You can try to get last_hidden_state from fine-tuned AST, and these embeddings use to learn nn.Linear for your classification task. This should be fast to check, often this give good results :)

sanchit-gandhi

Apr 18, 2023

I think AST is likely to struggle here since it's pre-trained on generic audio sounds (rather than speech) - I would strongly advocate for using Whisper!

PranavB

Apr 18, 2023

@sanchit-gandhi thank you for suggestions. I'll follow your tips

Zahra99

Jun 28, 2023

•

edited Jun 28, 2023

Hi @sanchit-gandhi ,
Could you kindly provide me with the link to your code for Whisper language identification? I believe it would greatly assist me in my current project which involves emotion classification, as I am exploring similar concepts and techniques.

Additionally, I have posted a related question in the Hugging Face forum, which I believe aligns with your expertise. Here is the link to my question:
https://discuss.huggingface.co/t/fine-tuning-whisper-for-audio-classification/44735

I would greatly appreciate it if you could take a moment to review it and provide your valuable suggestions.
Thank you so much

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment