Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage
Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations:
I have two primary datasets:
- Small Dataset: ~120 clips for quicker iteration.
- Full Dataset: ~3k clips.
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.
Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
- Additional hyperparameter adjustments.
- Potential model architecture changes (if applicable).
- Dataset augmentation techniques that might improve generalization.
Thank you for any help or insights you can provide!