Haotian Xu - Researcher

About Me

I currently work at MiroMind, focusing on RL-Infra and RL algorithm development with a focus on Agentic-RL. I contributed the fully-async scheduling architecture to OpenRLHF, an open-source, Ray-based high-performance RLHF framework widely adopted by the community. My research spans LLM reasoning, MCTS, scaling laws for agentic reasoning, and efficient reasoning compression.

Research Interests

RL-Infra Agentic-RL LLM Reasoning MCTS RLHF Scaling Laws NLP LLM Safety

Work Experience

2025.11 – Present

MiroMind

RL-Infra & RL Algorithms (Agentic-RL)

2024 – 2025.8

Xiaohongshu (小红书) — HiLab

Math-Reasoning & Agentic-RL

2023

Douyin (抖音 / ByteDance)

AIGC Safety

2018 – 2023

Alibaba (阿里巴巴)

Content Safety, Public Opinion Analysis, Personal Privacy, LLM Safety

Education

2013 – 2016

Tsinghua University (清华大学)

Department of Electronic Engineering

2009 – 2013

UESTC (电子科技大学)

Bachelor's Degree

2025 Highlights

Top-venue publications accepted in 2025–2026

NeurIPS 2025

45 citations

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

X Mai, H Xu, W Wang, Y Zhang, W Zhang

Discovering scaling laws for agentic reinforcement learning where LLMs spontaneously learn to execute code for mathematical reasoning.

ICLR 2026

17 citations

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

G Chen, Z Qiao, X Chen, D Yu, H Xu, WX Zhao, R Song, W Yin, H Yin, et al.

Rethinking how to scale agent interactions for long-horizon research tasks.

ICLR 2026

11 citations

GEM: A Gym for Generalist LLMs

Z Liu, A Sims, K Duan, C Chen, S Yu, X Zhou, H Xu, S Xiong, B Liu, C Tan, et al.

A comprehensive gym environment for training and evaluating generalist large language models.

ICLR 2026

9 citations

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

H Lu, Y Wen, P Cheng, R Ding, J Guo, H Xu, C Wang, H Chen, X Jiang, et al.

A novel self-play framework that advances agent capabilities through search without external supervision.

ICLR 2026

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

S Peng, W Wang, Z Tian, S Yang, X Wu, H Xu, C Zhang, T Isobe, B Hu, M Zhang

A unified dynamic preference optimization framework that adaptively reweights samples by jointly considering data quality and model performance evolution during training.

IEEE TPAMI 2025

386 citations

From System 1 to System 2: A Survey of Reasoning Large Language Models

D Zhang, ZZ Li, ML Zhang, J Zhang, Z Liu, Y Yao, H Xu, J Zheng, X Chen, et al.

A comprehensive survey of reasoning LLMs, tracing the evolution from fast intuitive (System 1) to slow deliberate (System 2) reasoning paradigms.

EMNLP 2025

304 citations

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

J Hu, X Wu, W Shen, JK Liu, W Wang, S Jiang, H Wang, H Chen, B Chen, H Xu, et al.

A Ray-based open-source framework enabling scalable RLHF training, widely adopted by the research community.

ACL 2024

14 citations

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

W Sun, H Xu, X Yu, P Chen, S He, J Zhao, K Liu

Showing that LLMs can bootstrap inductive reasoning capability through deductive self-teaching.

ACL 2026

7 citations

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

ZZ Li, X Liang, Z Tang, L Ji, P Wang, H Xu, H Huang, W Deng, Y Gong, et al.

An efficient method to compress long chain-of-thought reasoning while preserving accuracy.

Selected Publications

Full list on Google Scholar

2025

From System 1 to System 2: A Survey of Reasoning Large Language Models

D Zhang, ZZ Li, ML Zhang, J Zhang, Z Liu, Y Yao, H Xu, et al.

IEEE Transactions on Pattern Analysis and Machine Intelligence 386 citations

2025

Reinforce++: Stabilizing Critic-free Policy Optimization with Global Advantage Normalization

J Hu, JK Liu, H Xu, W Shen

arXiv preprint 351 citations

2025

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

J Hu, X Wu, W Shen, JK Liu, W Wang, S Jiang, H Wang, H Chen, B Chen, H Xu, et al.

EMNLP 2025 304 citations

2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

G Xu, J Liu, M Yan, H Xu, J Si, Z Zhou, P Yi, X Gao, J Sang, R Zhang, et al.

arXiv preprint 104 citations

2025

RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

H Xu, X Wu, W Wang, Z Li, D Zheng, B Chen, Y Hu, S Kang, J Ji, Y Zhang, et al.

arXiv preprint 63 citations

2024

Interpretable Contrastive Monte Carlo Tree Search Reasoning

Z Gao, B Niu, X He, H Xu, H Liu, A Liu, X Hu, L Wen

arXiv preprint 59 citations

2025

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

X Mai, H Xu, W Wang, Y Zhang, W Zhang

NeurIPS 2025 45 citations

2025

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

W Sun, X Yu, Z Huang, H Xu, S He, J Zhao, K Liu

CCL 2025 27 citations

2023

No Train Still Gain. Unleash Mathematical Reasoning of LLMs with Monte Carlo Tree Search Guided by Energy Function

H Xu

arXiv preprint 26 citations

2022

Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

J Si, X Peng, C Li, H Xu, J Li

ICASSP 2022 22 citations

2026

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

G Chen, Z Qiao, X Chen, D Yu, H Xu, WX Zhao, R Song, W Yin, H Yin, et al.

ICLR 2026 17 citations

2024

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

W Sun, H Xu, X Yu, P Chen, S He, J Zhao, K Liu

ACL 2024 14 citations

2026

GEM: A Gym for Generalist LLMs

Z Liu, A Sims, K Duan, C Chen, S Yu, X Zhou, H Xu, S Xiong, B Liu, C Tan, et al.

ICLR 2026 11 citations

2026

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

H Lu, Y Wen, P Cheng, R Ding, J Guo, H Xu, C Wang, H Chen, X Jiang, et al.

ICLR 2026 9 citations

2026

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

S Peng, W Wang, Z Tian, S Yang, X Wu, H Xu, C Zhang, T Isobe, B Hu, M Zhang

ICLR 2026

2025

Probabilistic Uncertain Reward Model

W Sun, X Cheng, X Yu, H Xu, Z Yang, S He, J Zhao, K Liu

arXiv preprint 7 citations

2026

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

ZZ Li, X Liang, Z Tang, L Ji, P Wang, H Xu, et al.

ACL 2026 7 citations

2020

End-to-end Latent-variable Task-oriented Dialogue System with Exact Log-likelihood Optimization

H Xu, H Peng, H Xie, E Cambria, L Zhou, W Zheng

World Wide Web 23(3) 45 citations

Haotian Xu 许皓天

About Me

Research Interests

Work Experience

MiroMind

Xiaohongshu (小红书) — HiLab

Douyin (抖音 / ByteDance)

Alibaba (阿里巴巴)

Education

Tsinghua University (清华大学)

UESTC (电子科技大学)

2025 Highlights

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

GEM: A Gym for Generalist LLMs

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

From System 1 to System 2: A Survey of Reasoning Large Language Models

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Selected Publications

From System 1 to System 2: A Survey of Reasoning Large Language Models

Reinforce++: Stabilizing Critic-free Policy Optimization with Global Advantage Normalization

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

Interpretable Contrastive Monte Carlo Tree Search Reasoning

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

No Train Still Gain. Unleash Mathematical Reasoning of LLMs with Monte Carlo Tree Search Guided by Energy Function

Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

GEM: A Gym for Generalist LLMs

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Probabilistic Uncertain Reward Model

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

End-to-end Latent-variable Task-oriented Dialogue System with Exact Log-likelihood Optimization

Zhihu Contributions

haotian

Featured Articles

LLM的快思考与慢思考路线之MCTS

o1复现的一点点心得

当RL的时候，我们到底在做啥

RL细节为王

删繁就简的LLM-RL

ZERO-TIR-RL: 以小搏大

Zero-TIR-Scale & Async-Pipeline

初探MOE-RL训推一致性

staleness很大的时候，如何保证off-policy RL训练稳定性

Get in Touch