Links for Reference
- Repository: https://github.com/kaistAI/Volcano
- Paper: https://arxiv.org/abs/2311.07362
Overview
Volcano employs a single LMM to generate initial responses, feedback, and revisions, as well as decisions to accept revisions. It follows a sequential procedure of an iterative critique-revision-decide loop.
Model details
Model type: Volcano-13b is a multimodal self-feedback guided revision model that was fine-tuned by mixing the visual instruction tuning dataset used in LLaVA-v1.5 with multimodal feedback and revision data collected through gpt-3.5-turbo, applied to the vicuna-13b-v1.5 model.
Model date: Volcano-13b was trained in October 2023.
Training dataset
- 274K multimodal feedback and revision data
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
- 158K GPT-generated multimodal instruction-following data.
- 450K academic-task-oriented VQA data mixture.
- 40K ShareGPT data
You can find here the dataset used to train Volcano, which includes all the aforementioned datasets.
Evaluation dataset
A collection of three multimodal hallucination benchmarks (MMHal-Bench, Pope, GAVIE) and two multimodal understanding benchmarks (MM-Vet, MMBench).
- Downloads last month
- 19