How I can generate the ttext more than a single word?

#8
by AhmadMujtaba200210 - opened

I am using this model but I am unable to generate the response in more than a word, for example,
my question is describe this picture it response me, No.

here is code,

import requests
from PIL import Image
import matplotlib.pyplot as plt
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

Display the image

plt.imshow(raw_image)
plt.axis('off')
plt.show()

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)

Decode and print every possible output

for i in range(out.shape[0]):
answer = processor.decode(out[i], skip_special_tokens=True)
print(f"Answer {i + 1}: {answer}")

Hi @AhmadMujtaba200210
Thanks for the issue,

I think this is expected, per my understanding this model is trained to generate short output / sentences.

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 1

To generate longer sentences you can either use the image captioning models, or use other architectures, such as llava: https://huggingface.co/llava-hf

any ways to solve this problem?

Sign up or log in to comment