Abstract
Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset (2024)
- DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence (2024)
- DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning (2024)
- Unsupervised Evaluation of Code LLMs with Round-Trip Correctness (2024)
- DevEval: Evaluating Code Generation in Practical Software Projects (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This paper was selected and reviewed at Harmonious as the spotlight paper for the week of March 4, 2024.
Authors: please comment/correct as appropriate.
Models citing this paper 1
Datasets citing this paper 5
Browse 5 datasets citing this paperSpaces citing this paper 0
No Space linking this paper