How I can generate the ttext more than a single word?

by AhmadMujtaba200210 - opened Dec 6, 2023

AhmadMujtaba200210

Dec 6, 2023

I am using this model but I am unable to generate the response in more than a word, for example,
my question is describe this picture it response me, No.

here is code,

import requests
from PIL import Image
import matplotlib.pyplot as plt
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

Display the image

plt.imshow(raw_image)
plt.axis('off')
plt.show()

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)

Decode and print every possible output

for i in range(out.shape[0]):
answer = processor.decode(out[i], skip_special_tokens=True)
print(f"Answer {i + 1}: {answer}")

ybelkada

Dec 7, 2023

Hi @AhmadMujtaba200210
Thanks for the issue,

I think this is expected, per my understanding this model is trained to generate short output / sentences.

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 1

To generate longer sentences you can either use the image captioning models, or use other architectures, such as llava: https://huggingface.co/llava-hf

lzymok

5 days ago

any ways to solve this problem?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment