microsoft/Florence-2-large-ft · Fine Tuning for VQA

I am attempting to fine-tune the Florence-2 model for a VQA (Visual Question Answering) task using a dataset of medical images. However, I am not getting consistent results with question-answer pairs alone. The goal is to be able to ask questions such as whether there is a specific object in the image or if a specific object is missing.

What kind of dataset structure should I create for this? Should it include location information such as bounding boxes (bbox)?

In my current approach, I am planning to train by combining two different task prompts: question and label+bbox.

For context, here is the structure I am considering:

I have also tried training each prompt individually, such as using only question-answer pairs or only the format, but I did not achieve better results. Additionally, I tested the following structures without success:

I am looking for advice on how to structure the dataset effectively. Any suggestions or insights on improving my approach would be greatly appreciated.