UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

1 Department of Computer Science and Technology, Tsinghua University
2 Zhipu AI

* Equal contribution.

Corresponding author: jietang@mail.tsinghua.edu.cn

UI2Code^N is a visual language foundation model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding, which unifies three key capabilities: UI-to-code generation , UI editing , and UI polishing .

UI2Code^N

Abstract

User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2CodeN, a visual language foundation model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2CodeN establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5.

Method Overview

UI2CodeN follows an interactive UI-to-code paradigm that progressively generates, edits and refines UI code with visual feedback.

Method Overview

Training Recipe

Although recent VLMs have demonstrated substantial progress on general vision benchmarks, their performance on UI coding remains limited due to two main challenges.

First, the inherent difficulty of UI coding: the model must perceive UI-style images with fine-grained details such as icons, fonts, and layout structures—quite different from natural images used in conventional pretraining. Moreover, UI code (HTML/CSS/JavaScript) can exceed 10,000 tokens and requires precise alignment between image and code at both global and element-level.

Second, the limitations of available training data. Real webpages provide rich diversity but contain noisy HTML tied to external resources, making them unsuitable for direct supervision. In contrast, synthetic or pruned datasets are clean but lack real-world complexity. As a result, prior models mainly rely on synthetic data, leaving large-scale real website data underutilized and limiting performance in practical applications.

We design a three-stage training pipeline:

  • Stage 1 — Continual Pre-training: Train on large-scale real webpage image–HTML pairs to build broad UI coding foundations.
  • Stage 2 — Supervised Fine-tuning: Use curated clean datasets to enhance core capabilities: UI-to-code generation, UI editing, and UI polishing.
  • Stage 3 — Reinforcement Learning: Adapt the model to complex real-world webpages using a verifier-based reward without paired ground-truth HTML.
  • Three-Stage Training Pipeline

    Introduction

    We propose a novel Interactive UI-to-Code paradigm that fundamentally departs from prior single-turn generation approaches, redefining UI-to-code as an iterative and interactive process of generation, editing, and polishing. Such paradigm provides flexible usage with enhanced performance and enables test-time scaling in UI-to-code generation.

    Guided by this paradigm, we present UI2CodeN, a powerful visual language model trained via a three-stage training pipeline: large-scale pretraining on noisy real-world data to build broad multimodal foundations, supervised fine-tuning on synthetic datasets to improve code quality, and reinforcement learning with a carefully designed verifier to exploit unpaired real webpages while maintaining generation fidelity.

    Experimental results demonstrate that UI2CodeN achieves state-of-the-art performance in UI coding. Building upon the core task of UI-to-code, UI2CodeN further extends its capabilities to UI polishing and UI editing.

    Demo Cases

    1. UI-to-Code Generation Demo

    Generate clean, executable HTML/CSS/JavaScript code directly from UI screenshots with high fidelity.

    Reference Screenshot
    Reference UI
    Generated HTML Code 👁️ View Live Demo
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Amazon Homepage</title>
        <style>
            * {
                margin: 0;
                padding: 0;
                box-sizing: border-box;
            }
            
            body {
                font-family: Arial, sans-serif;
                background: #e3e6e6;
            }
            
            .header {
                background: #131921;
                color: white;
                padding: 10px 20px;
                display: flex;
                align-items: center;
                gap: 20px;
            }
            
            .search-bar {
                flex: 1;
                display: flex;
                max-width: 800px;
            }
            
            .hero-banner {
                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                padding: 40px;
                text-align: center;
                color: white;
            }
            
            .product-grid {
                display: grid;
                grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
                gap: 20px;
                padding: 20px;
            }
        </style>
    </head>
    <body>
        <!-- Header content -->
        <div class="header">
            <div class="logo">Amazon</div>
            <div class="search-bar">
                <input type="text" placeholder="Search Amazon">
            </div>
        </div>
        
        <!-- Hero Banner -->
        <div class="hero-banner">
            <h1>Prime Big Deal Days</h1>
            <p>Members unlock early deals</p>
        </div>
        
        <!-- Product Grid -->
        <div class="product-grid">
            <!-- Product cards here -->
        </div>
    </body>
    </html>
    Rendered Result
    Rendered UI

    2. UI Editing Demo

    Modify existing UI code based on natural language instructions and visual context.

    Reference Screenshot
    Reference UI
    💬 Prompt:
    "change the background color of the original UI to purple."
    Generated HTML Code 👁️ View Live Demo
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Amazon Homepage</title>
        <style>
            * {
                margin: 0;
                padding: 0;
                box-sizing: border-box;
            }
            
            body {
                font-family: Arial, sans-serif;
                background: #e3e6e6;
            }
            
            .header {
                background: #131921;
                color: white;
                padding: 10px 20px;
                display: flex;
                align-items: center;
                gap: 20px;
            }
            
            .search-bar {
                flex: 1;
                display: flex;
                max-width: 800px;
            }
            
            .hero-banner {
                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                padding: 40px;
                text-align: center;
                color: white;
            }
            
            .product-grid {
                display: grid;
                grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
                gap: 20px;
                padding: 20px;
            }
        </style>
    </head>
    <body>
        <!-- Header content -->
        <div class="header">
            <div class="logo">Amazon</div>
            <div class="search-bar">
                <input type="text" placeholder="Search Amazon">
            </div>
        </div>
        
        <!-- Hero Banner -->
        <div class="hero-banner">
            <h1>Prime Big Deal Days</h1>
            <p>Members unlock early deals</p>
        </div>
        
        <!-- Product Grid -->
        <div class="product-grid">
            <!-- Product cards here -->
        </div>
    </body>
    </html>
    Rendered Result
    Rendered UI

    3. UI Polishing Demo

    Iteratively refine UI layout, spacing, typography, and aesthetics to match design style.

    Reference Screenshot
    Reference UI
    First Generated HTML Code 👁️ View Live Demo
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Basic Layout</title>
        <style>
            body {
                font-family: Arial;
                margin: 20px;
            }
            
            .header {
                background: #333;
                color: white;
                padding: 20px;
            }
            
            .content {
                margin-top: 20px;
            }
        </style>
    </head>
    <body>
        <div class="header">
            <h1>Header</h1>
        </div>
        <div class="content">
            <p>Content here</p>
        </div>
    </body>
    </html>
    First Rendered Result
    First Rendered UI
    ⬇️

    Iterative Polishing

    💬 Polishing Prompt:
    "Please compare the design draft and the rendered screenshot in detail, pointing out the differences in layout, color scheme, typography, fonts, spacing, component styles, and interactive effects. Based on this, modify the HTML and CSS to output a complete and runnable HTML file, making the rendered result closer to the target UI."
    Polished HTML Code 👁️ View Live Demo
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Polished Layout</title>
        <style>
            * {
                margin: 0;
                padding: 0;
                box-sizing: border-box;
            }
            
            body {
                font-family: 'Segoe UI', system-ui, sans-serif;
                background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
                padding: 40px;
            }
            
            .header {
                background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                color: white;
                padding: 30px 40px;
                border-radius: 12px;
                box-shadow: 0 10px 30px rgba(0,0,0,0.2);
            }
            
            .content {
                margin-top: 30px;
                padding: 30px;
                background: white;
                border-radius: 12px;
                box-shadow: 0 5px 15px rgba(0,0,0,0.1);
            }
        </style>
    </head>
    <body>
        <div class="header">
            <h1 style="font-size: 32px; font-weight: 600;">Modern Header</h1>
        </div>
        <div class="content">
            <p style="line-height: 1.8; color: #333;">Beautifully styled content</p>
        </div>
    </body>
    </html>
    Polished Result
    Polished UI

    Experimental Results

    We evaluate UI2CodeN on two major tasks: (1) UI-to-Code generation and (2) UI Polishing. Results demonstrate that UI2CodeN consistently achieves state-of-the-art performance among all open-source models and is competitive with leading closed-source VLMs.

    UI2CodeN outperforms GPT-4o, Claude 3.7/4 Sonnet, and Gemini-2.5 in most UI coding benchmarks and approaches GPT-5 in overall performance. Our RL-tuned version UI2CodeN-9B-RL achieves the best overall open-source performance and even surpasses some commercial models.

    Model UI-to-Code UI Polishing
    Design2Code Flame Web2Code UI2Code-Real UIPolish-Real UIPolish-Synthetic
    Open-source VLM
    InternVL3-9B15.311.312.316.54.07.0
    InternVL3-78B30.051.345.530.410.015.0
    Qwen2.5-VL-7B29.125.037.226.111.014.0
    Qwen2.5-VL-72B41.946.364.140.923.038.0
    MiMo-VL-7B-SFT28.310.044.333.917.033.0
    MiMo-VL-7B-RL28.78.838.330.416.030.0
    Kimi-VL-A3B-Instruct27.350.069.126.114.040.0
    Kimi-VL-A3B-Thinking38.836.346.627.014.027.0
    GLM-4.1V-9B-Thinking64.772.571.353.042.046.0
    Closed-source VLM
    Claude-4-Sonnet-thinking81.276.385.163.578.065.0
    Claude-3.7-Sonnet-thinking77.780.073.355.875.062.0
    GPT-589.791.393.767.885.068.0
    GPT-4o35.375.062.721.726.014.0
    o4-mini63.883.877.959.165.065.0
    Gemini-2.5-Pro89.587.590.668.774.068.0
    Gemini-2.5-Flash70.572.585.762.617.024.0
    Doubao-1.5-thinking-vision53.778.855.638.351.061.0
    Doubao-1.6-thinking-25071562.467.767.243.461.067.0
    UI2CodeN
    UI2CodeN-9B-SFT79.385.080.867.076.089.0
    UI2CodeN-9B-RL 88.6 95.0 92.5 76.5 80.0 94.0

    Citation

    @article{ui2coden2025,
        title   = {UI2Code$^{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation},
        author  = {Yang, Zhen and Hong, Wenyi and Xu, Mingde and Fan, Xinyue and Wang, Weihan and Gu, Xiaotao and Tang, Jie},
        journal = {arXiv preprint arXiv:2511.08195},
        year    = {2025}
    }