Harnessing Webpage UIs for Text-Rich Visual Understanding
Abstract
Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48\% improvement on VisualWebBench and a 19.1\% boost in action accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.
Community
MultiUI: a 7.3M multimodal instruction tuning dataset constructed from 1M webpage UIs for text-rich visual understanding scenarios (including GUI understanding/web agents, text recognition, document understanding, etc.)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (2024)
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks (2024)
- Building and better understanding vision-language models: insights and future directions (2024)
- WebQuest: A Benchmark for Multimodal QA on Web Page Sequences (2024)
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend