Abstract
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.
Community
TLDR: ShowUI is a lightweight vision-language-action model for GUI agents.
Github: https://github.com/showlab/ShowUI/
ArXiv: https://arxiv.org/abs/2411.17465
HF Models: https://huggingface.co/showlab/ShowUI-2B
HF Spaces: https://huggingface.co/spaces/showlab/ShowUI
HF Datasets: https://huggingface.co/datasets/showlab/ShowUI-desktop-8K
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data (2024)
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (2024)
- Harnessing Webpage UIs for Text-Rich Visual Understanding (2024)
- OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2024)
- Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (2024)
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks (2024)
- DOGE: Towards Versatile Visual Document Grounding and Referring (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend