google/paligemma-3b-pt-224

Jul 12

I couldn't find a good resource on how to use it with video input for finetuning or doing even simple inference.
I noticed from the paper that some video tasks are also benchmarked.
Can somebody help please?

scdrand23 changed discussion status to closed Jul 13

chenzizhao

Jul 16

Hi @scdrand23 , I am interested in this too. Have you found video support in HF's paligemma suite?

scdrand23

Jul 16

Not HF support , but I found out that you can just give it as a list of frames after you convert the video to frame chunks.

model_inputs = processor(text=prompt, images=frames, return_tensors="pt")

chenzizhao

Jul 16

Thanks for the speedy reply. I stepped through what you described. Suppose batch size of 1, in the _merge_input_ids_with_image_features function, the final embedding only absorbs the first frame, right? https://github.com/huggingface/transformers/blob/88e0813d8dde26b43a427c5d1a519f0e6ce3392f/src/transformers/models/paligemma/modeling_paligemma.py#L311-L314

 final_embedding = final_embedding.masked_scatter(
            image_mask.unsqueeze(-1).expand_as(final_embedding).to(device=final_embedding.device),
            scaled_image_features.to(device=final_embedding.device, dtype=final_embedding.dtype),
 )

google
/

paligemma-3b-pt-224

Video Input