WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

^* Equal contribution.

^† Corresponding authors: yang-zhen@mail.tsinghua.edu.cn, jietang@mail.tsinghua.edu.cn

WebVIA is the first agentic framework for interactive and verifiable UI-to-Code generation. While prior vision-language models only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces.

The framework consists of three modules:

WebVIA-Agent – navigates websites and captures multi-state UI screenshots.
WebVIA-UI2Code – generates functional HTML/CSS/JavaScript code with interactivity.
Validation Module – verifies whether the generated UI behaves as expected.

WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro), and WebVIA-UI2Code significantly improves executable and interactive code generation across multiple benchmarks.

Abstract

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision–Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks.

Agentic Framework

WebVIA is the first agentic framework that enables end-to-end interactive UI-to-code generation, from multi-state UI exploration to executable code synthesis and automatic validation.

WebVIA-Agent explores websites, identifies UI states, and captures key screenshots and DOM states.

WebVIA-UI2Code transforms these UI states into functional HTML/CSS/JavaScript code.

Validation Module executes the generated code and checks whether the UI behavior aligns with expected interactivity.

Introduction

Unlike previous UI2Code methods that only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces by integrating three components: (1) WebVIA-Agent, which interacts with the HTML environment to explore webpages and capture multi-state UI screenshots; (2) WebVIA-UI2Code, which generates functional HTML/CSS/JavaScript code based on these screenshots; (3) a validation module that verifies whether the generated interface supports real user interactions.

We train two core models to ensure strong performance. WebVIA-Agent is trained on a large-scale GUI interaction dataset and demonstrates higher stability and accuracy than general-purpose agents such as Gemini-2.5-Pro. WebVIA-UI2Code, built upon Qwen-2.5-VL-7B and GLM-4.1V-9B, is fine-tuned using paired multi-state screenshots and executable HTML, achieving +5.2 and +4.7 improvements on Design2Code, and competitive results on UIFlow2Code.

Demo Cases

1. Agent Exploration Demo

WebVIA-Agent explores real web pages by interacting with DOM elements, collecting multi-state UI screenshots.

2. UI2Code Generation Demo

Given multi-state UI screenshots, WebVIA-UI2Code generates fully executable HTML/CSS/JavaScript code, enabling real interactions such as clicking, input typing, and page navigation.

Experimental Results

We evaluate WebVIA on two key tasks: (1) UI exploration and (2) interactive UI-to-Code generation. Results demonstrate that WebVIA consistently outperforms existing VLM-based approaches in terms of stability, correctness, and interactivity.

1. WebVIA-Agent: UI Exploration Performance

We evaluate WebVIA-Agent on UIExplore-Bench (56 webpages) and compare with state-of-the-art VLM agents. WebVIA-Agent achieves the highest overall score by improving exploration completeness and correctness.

Model	Completeness (%)	Correctness (%)	Deduplication Rate (%)	Overall Score (%)
Gemini-2.5-Pro	92.61	95.39	5.60	71.83
GPT-5	76.66	90.19	93.82	85.69
o4-mini	91.73	94.07	52.73	82.80
GPT-4o	16.46	62.63	97.45	52.87
Claude-Sonnet-3.7	75.86	94.06	72.36	81.35
Claude-Sonnet-4	86.26	95.07	80.36	87.87
WebVIA-Agent	93.12	97.71	72.73	89.63

2. WebVIA-UI2Code: Interactive Code Generation

We fine-tune Qwen2.5-VL-7B-Instruct and GLM-4.1-V-9B-Base to build WebVIA-UI2Code models. Both significantly improve on Design2Code and UIFlow2Code benchmarks while producing executable interactive HTML/CSS/JavaScript.

Model	Design2Code	UIFlow2Code
Qwen2.5-VL-7B-Instruct	29.1	/
WebVIA-UI2Code-Qwen	34.3	75.9
GLM-4.1-V-9B-Base	58.3	/
WebVIA-UI2Code-GLM	63.0	84.9

Citation

@article{xu2025webvia,
    title={WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation},
    author={Xu, Mingde and Yang, Zhen and Hong, Wenyi and Pan, Lihang and Fan, Xinyue and Wang, Yan and Gu, Xiaotao and Xu, Bin and Tang, Jie},
    year={2025},
    journal={arXiv preprint arXiv:2511.06251}
}