Fine Tuning for VQA
I am attempting to fine-tune the Florence-2 model for a VQA (Visual Question Answering) task using a dataset of medical images. However, I am not getting consistent results with question-answer pairs alone. The goal is to be able to ask questions such as whether there is a specific object in the image or if a specific object is missing.
What kind of dataset structure should I create for this? Should it include location information such as bounding boxes (bbox)?
In my current approach, I am planning to train by combining two different task prompts: question and label+bbox.
For context, here is the structure I am considering:
I have also tried training each prompt individually, such as using only question-answer pairs or only the format, but I did not achieve better results. Additionally, I tested the following structures without success:
I am looking for advice on how to structure the dataset effectively. Any suggestions or insights on improving my approach would be greatly appreciated.