pretrained model for audio emotion classification
Is there any pre-trained model for audio emotion classification? if not then is anyone interested to collaborate with me to build one?
Cc'ing @sanchit-gandhi here
Hey
@PranavB
! I've tried fine-tuning AST for speech related tasks and unfortunately the performance is not very good π
My conclusion is that there's too big a domain mis-match between the AST pre-training data (generic audio sounds) and speech. You can see the checkpoint I trained for language identification on the FLEURS dataset here: https://huggingface.co/sanchit-gandhi/ast-fleurs-langid-max-length-2048/tensorboard
Eval accuracy is only 17%...
IMO there's much more promise in fine-tuning Whisper, e.g. on FLEURS I get 88% eval accuracy after just 3 epochs: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id
See related PRs here: https://github.com/huggingface/transformers/pull/21754
And here: https://github.com/huggingface/transformers/pull/21756
I think emotion classification would be cool! You can probably copy over the scripts that I used for Whisper language identification and change the dataset to an emotion classification one
I think AST is likely to struggle here since it's pre-trained on generic audio sounds (rather than speech) - I would strongly advocate for using Whisper!
@sanchit-gandhi thank you for suggestions. I'll follow your tips
Hi
@sanchit-gandhi
,
Could you kindly provide me with the link to your code for Whisper language identification? I believe it would greatly assist me in my current project which involves emotion classification, as I am exploring similar concepts and techniques.
Additionally, I have posted a related question in the Hugging Face forum, which I believe aligns with your expertise. Here is the link to my question:
https://discuss.huggingface.co/t/fine-tuning-whisper-for-audio-classification/44735
I would greatly appreciate it if you could take a moment to review it and provide your valuable suggestions.
Thank you so much