Math SE

Improving multimodal mathematical reasoning via Self-Evolving reflection loops and Reward-Guided fine-tuning.

1 Beihang University · 2 Tsinghua University · 3 Zhipu AI
*Equal contribution · †Corresponding author
Mathematical Reasoning · Vision-Language Model

Abstract

MathSE unifies distilled supervision, an Outcome Reward Model (ORM), and reflection-driven data refresh to progressively enhance math reasoning in multimodal LLMs.

Multimodal large language models excel at perception-heavy tasks yet still falter on multi-step math reasoning. Previous approaches mainly rely on static distillation from GPT-4 class teachers, leaving limited coverage of failure modes and data drift. MathSE introduces a self-evolving loop: we distill seed reasoning traces, train an ORM to judge full chains-of-thought, and feed ORM critiques back to the model to synthesize refined solutions.

The refreshed traces—both correct originals and ORM-guided reflections—are appended to the supervised fine-tuning pool, forming a continually improving curriculum. Across CogVLM2, Qwen2-VL-7B, and InternVL2.5-8B backbones, MathSE yields consistent improvements on MathVista, MathVerse, MathVision, and MathVL-test, surpassing prior open-source MLLMs on MathVL-test.

Key Highlights

The MathSE loop mirrors “practice → critique → refinement”, allowing models to internalize structured math feedback.

Distilled SFT

Bootstraps each backbone with GPT-4o distilled chain-of-thought supervision covering algebra, geometry, and diagram reasoning tasks.

Outcome Reward Model

Judges entire reasoning paths, pinpoints the first erroneous step, and produces structured diagnostics that remain faithful to visual evidence.

Reflection Refresh

Incorrect traces plus ORM feedback are regenerated into refined solutions that continually expand the supervised dataset.

Motivation & Findings

MathSE targets the accuracy gap between perception-heavy multimodal reasoning and rigorous math verification.

Why self-evolving?

Static GPT-4o distillation underfits rare error cases and diagram-intensive prompts. MathSE pairs ORM feedback with revised reasoning traces so the training corpus grows alongside model capability.

Headline results

  • +2.1 average gain on MathVL-test across CogVLM2, Qwen2-VL-7B, InternVL2.5-8B.
  • ORM detects the first failure step 81% of the time, keeping critiques actionable.
MathSE motivation diagram

High-level positioning of MathSE within multimodal math reasoning.

Self-Evolving Pipeline

Each iteration appends higher-quality traces, letting MathSE outpace static distilled corpora under the same data budget.

Self-evolving pipeline overview

Full self-evolving pipeline used in the main paper.

1 · Inference

Generate draft solutions on curated math + diagram workloads spanning MathVista and MathVL-test splits.

2 · ORM Feedback

Outcome Reward Model flags the earliest incorrect step and explains the visual or symbolic mismatch that caused it.

3 · Reflection SFT

Correct drafts and ORM-guided reflections are merged and used to fine-tune the base model, then the loop repeats.

Error Diagnostics

ORM critiques surface distributional weaknesses (visual perception, symbolic slips) so we can target corrective reflections.

Transition tracking

The transition log shows the correct→correct path climbing from 402 cases at Base→Stage1 to 1,018 by Stage3→4, indicating MathSE steadily locks in reliable solutions. Meanwhile incorrect→incorrect drops from 678 to 477, and incorrect→correct falls from 705 to 276 as reflections “harvest” the easy fixes early. The residual correct→incorrect curve stays low (≈215–307), suggesting ORM guidance rarely damages working solutions.

Error transitions across reflection rounds

Flow of errors between categories once ORM feedback is applied.

Appendix Error Taxonomies

Supplemental figures show how ORM critiques break down perception, knowledge, calculation, and reasoning failures.

Visual error taxonomy

Appendix taxonomy for perception-related mistakes.

Knowledge error grid

Examples of factual slips detected by ORM.

Calculation error grid

Numeric mistakes before/after reflection.

Reasoning error grid

Multi-step reasoning pitfalls summarized in the appendix.

Resources

Paper Package

Coming Soon

  • Training code & evaluation scripts
  • Outcome Reward Model checkpoints
  • Reflection dataset + prompts

Citation

@article{Chen2025MathSE,
  title   = {{MathSE}: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning},
  author  = {Chen, Jinhao and Yang, Zhen and Shi, Jianxin and Wo, Tianyu and Tang, Jie},
  journal = {ArXiv preprint arXiv:2511.06805},
  year    = {2025}
}