HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Abstract
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. The benchmark and codebase will be released at https://github.com/tianyi-lab/HallusionBench.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning (2023)
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (2023)
- TouchStone: Evaluating Vision-Language Models by Language Models (2023)
- ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models (2023)
- Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning (2023)
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (2023)
- TouchStone: Evaluating Vision-Language Models by Language Models (2023)
- ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models (2023)
- Large Language Models are Temporal and Causal Reasoners for Video Question Answering (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Unveiling HallusionBench: Tackling Visual Illusions in AI Models!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper