Does Llama-3.2 Vision model support MultiImages?

#43

by JOJOHuang - opened Sep 29

Discussion

JOJOHuang

Sep 29

Does this model support Multi Images？ if True，like this？

image1 = Image.open(url1)
image2 = Image.open(url2)

messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "please describe these two images"}
]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor([image1, image2], input_text, return_tensors="pt").to(model.device)

Sanyam

Meta Llama org Sep 29

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

JOJOHuang

Sep 30

Thanks for the Q! We recommend using 1 image for inferencing, the model doesn't work reliably well with multiple images

Ok~ Thanks for your reply！

Sanyam changed discussion status to closed Oct 3

h3045

Oct 16

Hey Sanyam,

Thanks for the response.

Any idea why this is happening?

Is it a limitation of the model size or the lack of training?

What I understood from the documentation was that the model was trained with videos, so I was curious why it is not performant on multiple images.

danmir

Oct 22

I am cuda out of memory message when i use multiple images

sraliu

27 days ago

I have the same question, can this model infer video files? For example, using cv2 to generate a set of frames?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment