Emotion Recognition in English Using RAVDESS and Wav2Vec 2.0

This model extracts emotions from audio recordings. It was trained on RAVDESS, a dataset containing English audio recordings. The model recognises six emotions: anger, disgust, fear, happiness, sadness and surprise.

The model recreates the work of this Greek emotion extractor using a pre-trained Wav2Vec2 model to process the data.

Model Details

Model Description

Adapted from: Emotion Recognition in Greek
Model type: NN with CTC
Language(s) (NLP): English
Finetuned from model: wav2vec2

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

The RAVDESS dataset was split into training, validation and test sets with 60, 20 and 20 splits, respectively.

Training Procedure

The fine-tuning process was centred on four hyper-parameters:

the number of batches (4, 8),
gradient accumulation steps (GAS) (2, 4, 6, 8),
number of epochs (10, 20) and
the learning rate (1e-3, 1e-4, 1e-5).

Each experiment was repeated 10 times.

Evaluation

The set of hyper-parameters resulting in the best performance is: 4 batches, 4 GAS, 10 epochs and 1e-4 learning rate

Testing

The model was retrained on the combined train and validation sets using the best hyper-parameter set. The performance on the test set has an average Accuracy and F1 scores of 84.84% (SD 2 and 2.08, respectively)

Results

We retained the model providing the highest performance over the 10 runs.

Emotion	Accuracy	Precision	Recall	F1
Anger		96.55	87.50
Disgust		90.91	93.75
Fear		96.30	81.25
Happiness		93.10	84.38
Sad		81.58	96.88
Surprise		77.78	87.50
Total	88.54	89.37	88.54	88.62

AreejB
/

wav2vec2-xlsr-english-speech-emotion-recognition