Large Language and Video Assistant
Collection
Video LLM based on Llama-3 and Llama-3.1
•
4 items
•
Updated
Please follow my github repo LLaVA-Unified for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.
Please firstly install llava via
pip install git+https://github.com/Victorwz/LLaVA-Unified.git
You can load the model and perform inference as follows:
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from PIL import Image
import requests
import cv2
import torch
import base64
import io
from io import BytesIO
import numpy as np
# load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = get_model_name_from_path("weizhiwang/LLaVA-Video-Llama-3")
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Video-Llama-3", None, model_name, False, False, device=device)
# prepare image input
url = "https://github.com/PKU-YuanGroup/Video-LLaVA/raw/main/videollava/serve/examples/sample_demo_1.mp4"
def read_video(video_url):
response = requests.get(url)
if response.status_code != 200:
print("Failed to download video")
exit()
else:
with open("tmp_video.mp4", 'wb') as f:
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
video = cv2.VideoCapture("tmp_video.mp4")
base64Frames = []
while video.isOpened():
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
video.release()
print(len(base64Frames), "frames read.")
return base64Frames
video_frames = read_video(video_url=url)
image_tensors = []
samplng_interval = int(len(video_frames) / 10)
for i in range(0, len(video_frames), samplng_interval):
rawbytes = base64.b64decode(video_frames[i])
image = Image.open(io.BytesIO(rawbytes)).convert("RGB")
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0].half().cuda()
image_tensors.append(image_tensor)
# prepare inputs for the model
text = "\n".join(['<image>' for i in range(len(image_tensors))]) + '\n' + "Why is this video funny"
conv = conv_templates["llama_3"].copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()
# autoregressively generate text
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensors,
do_sample=False,
max_new_tokens=512,
use_cache=True)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])
The image caption results look like:
The video is funny because it shows a baby girl wearing glasses and reading a book, which is an unusual and amusing sight. It is not common to see a baby wearing glasses and engaging in a reading activity, as they are still developing their motor skills and cognitive abilities. The image captures a cute and endearing moment, as the baby appears to be enjoying her time and learning to read. This scene can evoke a sense of warmth and delight in the viewer, as it showcases the innocence and curiosity of childhood.
Please refer to our LLaVA-Unified git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.
@misc{wang2024llavavideollama3,
title={LLaVA-Video-Llama-3: A Video Understanding Multimodal LLM based on Llama-3-8B LLM backbone},
author={Wang, Weizhi},
year={2024}
}