language: en
license: mit
library_name: transformers
tags:
- video-classification
- videomae
- vision
Model Card for videomae-base-finetuned-ucf101
A WandB report here for metrics.
Table of Contents
- Model Details
- Uses
- Bias, Risks, and Limitations
- Training Details
- Evaluation
- Model Examination
- Environmental Impact
- Technical Specifications
- Citation
- Glossary
- More Information
- Model Card Authors
- Model Card Contact
- How To Get Started With the Model
Model Details
Model Description
VideoMAE Base model fine tuned on UCF101
- Developed by: @nateraw
- Shared by [optional]: [More Information Needed]
- Model type: fine-tuned
- Language(s) (NLP): en
- License: mit
- Related Models [optional]: [More Information Needed]
- Parent Model [optional]: MCG-NJU/videomae-base
- Resources for more information: [More Information Needed]
Uses
Direct Use
This model can be used for Video Action Recognition
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.
Training Details
Training Data
[More Information Needed]
Training Procedure [optional]
Preprocessing
We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's MixVideo
to apply mixup/cutmix.
Speeds, Sizes, Times
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too.
- Eval Accuracy: 0.758209764957428
- Eval Accuracy Top 5: 0.8983050584793091
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
Model Card Contact
How to Get Started with the Model
Use the code below to get started with the model.
Click to expand
from decord import VideoReader, cpu
import torch
import numpy as np
from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download
np.random.seed(0)
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))
# sample 16 frames
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()
feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
inputs = feature_extractor(list(video), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 101 UCF101 classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])