* Equal contribution.
† Corresponding authors: yang-zhen@mail.tsinghua.edu.cn, jietang@mail.tsinghua.edu.cn
WebVIA is the first agentic framework for interactive and verifiable UI-to-Code generation. While prior vision-language models only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces.
The framework consists of three modules:
WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro), and WebVIA-UI2Code significantly improves executable and interactive code generation across multiple benchmarks.
User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision–Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks.
WebVIA is the first agentic framework that enables end-to-end interactive UI-to-code generation, from multi-state UI exploration to executable code synthesis and automatic validation.
WebVIA-Agent explores websites, identifies UI states, and captures key screenshots and DOM states.
WebVIA-UI2Code transforms these UI states into functional HTML/CSS/JavaScript code.
Validation Module executes the generated code and checks whether the UI behavior aligns with expected interactivity.
Unlike previous UI2Code methods that only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces by integrating three components: (1) WebVIA-Agent, which interacts with the HTML environment to explore webpages and capture multi-state UI screenshots; (2) WebVIA-UI2Code, which generates functional HTML/CSS/JavaScript code based on these screenshots; (3) a validation module that verifies whether the generated interface supports real user interactions.
We train two core models to ensure strong performance. WebVIA-Agent is trained on a large-scale GUI interaction dataset and demonstrates higher stability and accuracy than general-purpose agents such as Gemini-2.5-Pro. WebVIA-UI2Code, built upon Qwen-2.5-VL-7B and GLM-4.1V-9B, is fine-tuned using paired multi-state screenshots and executable HTML, achieving +5.2 and +4.7 improvements on Design2Code, and competitive results on UIFlow2Code.
WebVIA-Agent explores real web pages by interacting with DOM elements, collecting multi-state UI screenshots.
Given multi-state UI screenshots, WebVIA-UI2Code generates fully executable HTML/CSS/JavaScript code, enabling real interactions such as clicking, input typing, and page navigation.
We evaluate WebVIA on two key tasks: (1) UI exploration and (2) interactive UI-to-Code generation. Results demonstrate that WebVIA consistently outperforms existing VLM-based approaches in terms of stability, correctness, and interactivity.
We evaluate WebVIA-Agent on UIExplore-Bench (56 webpages) and compare with state-of-the-art VLM agents. WebVIA-Agent achieves the highest overall score by improving exploration completeness and correctness.
| Model | Completeness (%) | Correctness (%) | Deduplication Rate (%) | Overall Score (%) |
|---|---|---|---|---|
| Gemini-2.5-Pro | 92.61 | 95.39 | 5.60 | 71.83 |
| GPT-5 | 76.66 | 90.19 | 93.82 | 85.69 |
| o4-mini | 91.73 | 94.07 | 52.73 | 82.80 |
| GPT-4o | 16.46 | 62.63 | 97.45 | 52.87 |
| Claude-Sonnet-3.7 | 75.86 | 94.06 | 72.36 | 81.35 |
| Claude-Sonnet-4 | 86.26 | 95.07 | 80.36 | 87.87 |
| WebVIA-Agent | 93.12 | 97.71 | 72.73 | 89.63 |
We fine-tune Qwen2.5-VL-7B-Instruct and GLM-4.1-V-9B-Base to build WebVIA-UI2Code models. Both significantly improve on Design2Code and UIFlow2Code benchmarks while producing executable interactive HTML/CSS/JavaScript.
| Model | Design2Code | UIFlow2Code |
|---|---|---|
| Qwen2.5-VL-7B-Instruct | 29.1 | / |
| WebVIA-UI2Code-Qwen | 34.3 | 75.9 |
| GLM-4.1-V-9B-Base | 58.3 | / |
| WebVIA-UI2Code-GLM | 63.0 | 84.9 |
@article{xu2025webvia,
title={WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation},
author={Xu, Mingde and Yang, Zhen and Hong, Wenyi and Pan, Lihang and Fan, Xinyue and Wang, Yan and Gu, Xiaotao and Xu, Bin and Tang, Jie},
year={2025},
journal={arXiv preprint arXiv:2511.06251}
}