MCG-NJU/videomae-base · Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage

Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:

Dataset & Variations:
I have two primary datasets:

Small Dataset: ~120 clips for quicker iteration.
Full Dataset: ~3k clips.
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.

Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.

I removed the RandomCrop transformation since it entirely removes the person from the video.

I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.

Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.

Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:

Additional hyperparameter adjustments.
Potential model architecture changes (if applicable).
Dataset augmentation techniques that might improve generalization.

Thank you for any help or insights you can provide!