WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

1 Faculty of Mathematics, University of Waterloo
2 The Knowledge Engineering Group (KEG), Tsinghua University
3 Zhipu AI

* Equal contribution.

Corresponding authors: yang-zhen@mail.tsinghua.edu.cn, jietang@mail.tsinghua.edu.cn

WebVIA is the first agentic framework for interactive and verifiable UI-to-Code generation. While prior vision-language models only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces.

The framework consists of three modules:

WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro), and WebVIA-UI2Code significantly improves executable and interactive code generation across multiple benchmarks.

WebVIA Framework

Abstract

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision–Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks.

Agentic Framework

WebVIA is the first agentic framework that enables end-to-end interactive UI-to-code generation, from multi-state UI exploration to executable code synthesis and automatic validation.

WebVIA Pipeline

WebVIA-Agent explores websites, identifies UI states, and captures key screenshots and DOM states.

WebVIA-UI2Code transforms these UI states into functional HTML/CSS/JavaScript code.

Validation Module executes the generated code and checks whether the UI behavior aligns with expected interactivity.

Introduction

Unlike previous UI2Code methods that only produce static HTML/CSS layouts, WebVIA enables executable and interactive web interfaces by integrating three components: (1) WebVIA-Agent, which interacts with the HTML environment to explore webpages and capture multi-state UI screenshots; (2) WebVIA-UI2Code, which generates functional HTML/CSS/JavaScript code based on these screenshots; (3) a validation module that verifies whether the generated interface supports real user interactions.

We train two core models to ensure strong performance. WebVIA-Agent is trained on a large-scale GUI interaction dataset and demonstrates higher stability and accuracy than general-purpose agents such as Gemini-2.5-Pro. WebVIA-UI2Code, built upon Qwen-2.5-VL-7B and GLM-4.1V-9B, is fine-tuned using paired multi-state screenshots and executable HTML, achieving +5.2 and +4.7 improvements on Design2Code, and competitive results on UIFlow2Code.

Demo Cases

1. Agent Exploration Demo

WebVIA-Agent explores real web pages by interacting with DOM elements, collecting multi-state UI screenshots.

Agent Exploration Demo

2. UI2Code Generation Demo

Given multi-state UI screenshots, WebVIA-UI2Code generates fully executable HTML/CSS/JavaScript code, enabling real interactions such as clicking, input typing, and page navigation.

UI2Code Demo

Experimental Results

We evaluate WebVIA on two key tasks: (1) UI exploration and (2) interactive UI-to-Code generation. Results demonstrate that WebVIA consistently outperforms existing VLM-based approaches in terms of stability, correctness, and interactivity.

1. WebVIA-Agent: UI Exploration Performance

We evaluate WebVIA-Agent on UIExplore-Bench (56 webpages) and compare with state-of-the-art VLM agents. WebVIA-Agent achieves the highest overall score by improving exploration completeness and correctness.

Model Completeness (%) Correctness (%) Deduplication Rate (%) Overall Score (%)
Gemini-2.5-Pro 92.61 95.39 5.60 71.83
GPT-5 76.66 90.19 93.82 85.69
o4-mini 91.73 94.07 52.73 82.80
GPT-4o 16.46 62.63 97.45 52.87
Claude-Sonnet-3.7 75.86 94.06 72.36 81.35
Claude-Sonnet-4 86.26 95.07 80.36 87.87
WebVIA-Agent 93.12 97.71 72.73 89.63

2. WebVIA-UI2Code: Interactive Code Generation

We fine-tune Qwen2.5-VL-7B-Instruct and GLM-4.1-V-9B-Base to build WebVIA-UI2Code models. Both significantly improve on Design2Code and UIFlow2Code benchmarks while producing executable interactive HTML/CSS/JavaScript.

Model Design2Code UIFlow2Code
Qwen2.5-VL-7B-Instruct 29.1 /
WebVIA-UI2Code-Qwen 34.3 75.9
GLM-4.1-V-9B-Base 58.3 /
WebVIA-UI2Code-GLM 63.0 84.9

Citation

@article{xu2025webvia,
    title={WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation},
    author={Xu, Mingde and Yang, Zhen and Hong, Wenyi and Pan, Lihang and Fan, Xinyue and Wang, Yan and Gu, Xiaotao and Xu, Bin and Tang, Jie},
    year={2025},
    journal={arXiv preprint arXiv:2511.06251}
}