datasets:
- narad/ravdess
language:
- en
metrics:
- f1
- accuracy
- recall
- precision
pipeline_tag: audio-classification
Emotion Recognition in English Using RAVDESS and Wav2Vec 2.0
This model extracts emotions from audio recordings. It was trained on RAVDESS, a dataset containing English audio recordings. The model recognises six emotions: anger, disgust, fear, happiness, sadness and surprise.
The model recreates the work of this Greek emotion extractor using a pre-trained Wav2Vec2 model to process the data.
Model Details
Model Description
- Adapted from: Emotion Recognition in Greek
- Model type: NN with CTC
- Language(s) (NLP): English
- Finetuned from model: wav2vec2
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
The RAVDESS dataset was split into training, validation and test sets with 60, 20 and 20 splits, respectively.
Training Procedure
The fine-tuning process was centred on four hyper-parameters:
- the number of batches (4, 8),
- gradient accumulation steps (GAS) (2, 4, 6, 8),
- number of epochs (10, 20) and
- the learning rate (1e-3, 1e-4, 1e-5).
Each experiment was repeated 10 times.
Evaluation
The set of hyper-parameters resulting in the best performance is: 4 batches, 4 GAS, 10 epochs and 1e-4 learning rate
Testing
The model was retrained on the combined train and validation sets using the best hyper-parameter set. The performance on the test set has an average Accuracy and F1 scores of 84.84% (SD 2 and 2.08, respectively)
Results
We retained the model providing the highest performance over the 10 runs.
Emotion | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|
Anger | 96.55 | 87.50 | ||
Disgust | 90.91 | 93.75 | ||
Fear | 96.30 | 81.25 | ||
Happiness | 93.10 | 84.38 | ||
Sad | 81.58 | 96.88 | ||
Surprise | 77.78 | 87.50 | ||
Total | 88.54 | 89.37 | 88.54 | 88.62 |