lbourdois's picture
Upload 174 files
94e735e verified
|
raw
history blame
1.6 kB

We have recently merged Video-LLaVA to @huggingface transformers! 🤗
🎞️ What makes this model different? keep reading ⇊

video

Demo | Model
See below how to initialize the model and processor and infer ⬇️

image_1

Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.

image_2

It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.

image_3

I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models it's a relatively older model but ahead of it's time and works very well!

image_4

Ressources:
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023)
GitHub
Hugging Face documentation

Original tweet (July 25, 2024)