* Equal contribution.
† Corresponding author: jietang@mail.tsinghua.edu.cn
UI2Code^N is a visual language foundation model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding, which unifies three key capabilities: UI-to-code generation , UI editing , and UI polishing .
User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2CodeN, a visual language foundation model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2CodeN establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5.
UI2CodeN follows an interactive UI-to-code paradigm that progressively generates, edits and refines UI code with visual feedback.
Although recent VLMs have demonstrated substantial progress on general vision benchmarks, their performance on UI coding remains limited due to two main challenges.
First, the inherent difficulty of UI coding: the model must perceive UI-style images with fine-grained details such as icons, fonts, and layout structures—quite different from natural images used in conventional pretraining. Moreover, UI code (HTML/CSS/JavaScript) can exceed 10,000 tokens and requires precise alignment between image and code at both global and element-level.
Second, the limitations of available training data. Real webpages provide rich diversity but contain noisy HTML tied to external resources, making them unsuitable for direct supervision. In contrast, synthetic or pruned datasets are clean but lack real-world complexity. As a result, prior models mainly rely on synthetic data, leaving large-scale real website data underutilized and limiting performance in practical applications.
We design a three-stage training pipeline:
We propose a novel Interactive UI-to-Code paradigm that fundamentally departs from prior single-turn generation approaches, redefining UI-to-code as an iterative and interactive process of generation, editing, and polishing. Such paradigm provides flexible usage with enhanced performance and enables test-time scaling in UI-to-code generation.
Guided by this paradigm, we present UI2CodeN, a powerful visual language model trained via a three-stage training pipeline: large-scale pretraining on noisy real-world data to build broad multimodal foundations, supervised fine-tuning on synthetic datasets to improve code quality, and reinforcement learning with a carefully designed verifier to exploit unpaired real webpages while maintaining generation fidelity.
Experimental results demonstrate that UI2CodeN achieves state-of-the-art performance in UI coding. Building upon the core task of UI-to-code, UI2CodeN further extends its capabilities to UI polishing and UI editing.
Generate clean, executable HTML/CSS/JavaScript code directly from UI screenshots with high fidelity.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Amazon Homepage</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: Arial, sans-serif;
background: #e3e6e6;
}
.header {
background: #131921;
color: white;
padding: 10px 20px;
display: flex;
align-items: center;
gap: 20px;
}
.search-bar {
flex: 1;
display: flex;
max-width: 800px;
}
.hero-banner {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
padding: 40px;
text-align: center;
color: white;
}
.product-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
gap: 20px;
padding: 20px;
}
</style>
</head>
<body>
<!-- Header content -->
<div class="header">
<div class="logo">Amazon</div>
<div class="search-bar">
<input type="text" placeholder="Search Amazon">
</div>
</div>
<!-- Hero Banner -->
<div class="hero-banner">
<h1>Prime Big Deal Days</h1>
<p>Members unlock early deals</p>
</div>
<!-- Product Grid -->
<div class="product-grid">
<!-- Product cards here -->
</div>
</body>
</html>
Modify existing UI code based on natural language instructions and visual context.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Amazon Homepage</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: Arial, sans-serif;
background: #e3e6e6;
}
.header {
background: #131921;
color: white;
padding: 10px 20px;
display: flex;
align-items: center;
gap: 20px;
}
.search-bar {
flex: 1;
display: flex;
max-width: 800px;
}
.hero-banner {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
padding: 40px;
text-align: center;
color: white;
}
.product-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
gap: 20px;
padding: 20px;
}
</style>
</head>
<body>
<!-- Header content -->
<div class="header">
<div class="logo">Amazon</div>
<div class="search-bar">
<input type="text" placeholder="Search Amazon">
</div>
</div>
<!-- Hero Banner -->
<div class="hero-banner">
<h1>Prime Big Deal Days</h1>
<p>Members unlock early deals</p>
</div>
<!-- Product Grid -->
<div class="product-grid">
<!-- Product cards here -->
</div>
</body>
</html>
Iteratively refine UI layout, spacing, typography, and aesthetics to match design style.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Basic Layout</title>
<style>
body {
font-family: Arial;
margin: 20px;
}
.header {
background: #333;
color: white;
padding: 20px;
}
.content {
margin-top: 20px;
}
</style>
</head>
<body>
<div class="header">
<h1>Header</h1>
</div>
<div class="content">
<p>Content here</p>
</div>
</body>
</html>
Iterative Polishing
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Polished Layout</title>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Segoe UI', system-ui, sans-serif;
background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
padding: 40px;
}
.header {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
padding: 30px 40px;
border-radius: 12px;
box-shadow: 0 10px 30px rgba(0,0,0,0.2);
}
.content {
margin-top: 30px;
padding: 30px;
background: white;
border-radius: 12px;
box-shadow: 0 5px 15px rgba(0,0,0,0.1);
}
</style>
</head>
<body>
<div class="header">
<h1 style="font-size: 32px; font-weight: 600;">Modern Header</h1>
</div>
<div class="content">
<p style="line-height: 1.8; color: #333;">Beautifully styled content</p>
</div>
</body>
</html>
We evaluate UI2CodeN on two major tasks: (1) UI-to-Code generation and (2) UI Polishing. Results demonstrate that UI2CodeN consistently achieves state-of-the-art performance among all open-source models and is competitive with leading closed-source VLMs.
UI2CodeN outperforms GPT-4o, Claude 3.7/4 Sonnet, and Gemini-2.5 in most UI coding benchmarks and approaches GPT-5 in overall performance. Our RL-tuned version UI2CodeN-9B-RL achieves the best overall open-source performance and even surpasses some commercial models.
| Model | UI-to-Code | UI Polishing | ||||
|---|---|---|---|---|---|---|
| Design2Code | Flame | Web2Code | UI2Code-Real | UIPolish-Real | UIPolish-Synthetic | |
| Open-source VLM | ||||||
| InternVL3-9B | 15.3 | 11.3 | 12.3 | 16.5 | 4.0 | 7.0 |
| InternVL3-78B | 30.0 | 51.3 | 45.5 | 30.4 | 10.0 | 15.0 |
| Qwen2.5-VL-7B | 29.1 | 25.0 | 37.2 | 26.1 | 11.0 | 14.0 |
| Qwen2.5-VL-72B | 41.9 | 46.3 | 64.1 | 40.9 | 23.0 | 38.0 |
| MiMo-VL-7B-SFT | 28.3 | 10.0 | 44.3 | 33.9 | 17.0 | 33.0 |
| MiMo-VL-7B-RL | 28.7 | 8.8 | 38.3 | 30.4 | 16.0 | 30.0 |
| Kimi-VL-A3B-Instruct | 27.3 | 50.0 | 69.1 | 26.1 | 14.0 | 40.0 |
| Kimi-VL-A3B-Thinking | 38.8 | 36.3 | 46.6 | 27.0 | 14.0 | 27.0 |
| GLM-4.1V-9B-Thinking | 64.7 | 72.5 | 71.3 | 53.0 | 42.0 | 46.0 |
| Closed-source VLM | ||||||
| Claude-4-Sonnet-thinking | 81.2 | 76.3 | 85.1 | 63.5 | 78.0 | 65.0 |
| Claude-3.7-Sonnet-thinking | 77.7 | 80.0 | 73.3 | 55.8 | 75.0 | 62.0 |
| GPT-5 | 89.7 | 91.3 | 93.7 | 67.8 | 85.0 | 68.0 |
| GPT-4o | 35.3 | 75.0 | 62.7 | 21.7 | 26.0 | 14.0 |
| o4-mini | 63.8 | 83.8 | 77.9 | 59.1 | 65.0 | 65.0 |
| Gemini-2.5-Pro | 89.5 | 87.5 | 90.6 | 68.7 | 74.0 | 68.0 |
| Gemini-2.5-Flash | 70.5 | 72.5 | 85.7 | 62.6 | 17.0 | 24.0 |
| Doubao-1.5-thinking-vision | 53.7 | 78.8 | 55.6 | 38.3 | 51.0 | 61.0 |
| Doubao-1.6-thinking-250715 | 62.4 | 67.7 | 67.2 | 43.4 | 61.0 | 67.0 |
| UI2CodeN | ||||||
| UI2CodeN-9B-SFT | 79.3 | 85.0 | 80.8 | 67.0 | 76.0 | 89.0 |
| UI2CodeN-9B-RL | 88.6 | 95.0 | 92.5 | 76.5 | 80.0 | 94.0 |
@article{ui2coden2025,
title = {UI2Code$^{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation},
author = {Yang, Zhen and Hong, Wenyi and Xu, Mingde and Fan, Xinyue and Wang, Weihan and Gu, Xiaotao and Tang, Jie},
journal = {arXiv preprint arXiv:2511.08195},
year = {2025}
}