Video Input
I couldn't find a good resource on how to use it with video input for finetuning or doing even simple inference.
I noticed from the paper that some video tasks are also benchmarked.
Can somebody help please?
Hi @scdrand23 , I am interested in this too. Have you found video support in HF's paligemma suite?
Not HF support , but I found out that you can just give it as a list of frames after you convert the video to frame chunks.
model_inputs = processor(text=prompt, images=frames, return_tensors="pt")
Thanks for the speedy reply. I stepped through what you described. Suppose batch size of 1, in the _merge_input_ids_with_image_features
function, the final embedding only absorbs the first frame, right? https://github.com/huggingface/transformers/blob/88e0813d8dde26b43a427c5d1a519f0e6ce3392f/src/transformers/models/paligemma/modeling_paligemma.py#L311-L314
final_embedding = final_embedding.masked_scatter(
image_mask.unsqueeze(-1).expand_as(final_embedding).to(device=final_embedding.device),
scaled_image_features.to(device=final_embedding.device, dtype=final_embedding.dtype),
)