metadata

language: en
license: mit
library_name: transformers
tags:
  - video-classification
  - videomae
  - vision

Model Card for videomae-base-finetuned-ucf101

A WandB report here for metrics.

Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Model Examination
Environmental Impact
Technical Specifications
Citation
Glossary
More Information
Model Card Authors
Model Card Contact
How To Get Started With the Model

Model Details

Model Description

VideoMAE Base model fine tuned on UCF101

Developed by: @nateraw
Shared by [optional]: [More Information Needed]
Model type: fine-tuned
Language(s) (NLP): en
License: mit
Related Models [optional]: [More Information Needed]
- Parent Model [optional]: MCG-NJU/videomae-base
Resources for more information: [More Information Needed]

Uses

Direct Use

This model can be used for Video Action Recognition

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.

Training Details

Training Data

[More Information Needed]

Training Procedure [optional]

Preprocessing

We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's MixVideo to apply mixup/cutmix.

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too.

Eval Accuracy: 0.758209764957428
Eval Accuracy Top 5: 0.8983050584793091

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

@nateraw

Model Card Contact

@nateraw

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

from decord import VideoReader, cpu
import torch
import numpy as np

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download

np.random.seed(0)


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices


# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
    repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")

inputs = feature_extractor(list(video), return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# model predicts one of the 101 UCF101 classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

nateraw
/

videomae-base-finetuned-ucf101

Model Card for videomae-base-finetuned-ucf101

Table of Contents

Model Details

Model Description

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

Training Details

Training Data

Training Procedure [optional]

Preprocessing

Speeds, Sizes, Times

Evaluation

Testing Data, Factors & Metrics

Testing Data

Factors

Metrics

Results

Model Examination [optional]

Environmental Impact

Technical Specifications [optional]

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation [optional]

Glossary [optional]

More Information [optional]

Model Card Authors [optional]

Model Card Contact

How to Get Started with the Model