Hello, I'm

Haotian Xu 许皓天

RL-Infra & RL Algorithms · Agentic-RL

MiroMind

1,500+ Citations
14 h-index
8+ Top Venues
Scroll

About Me

I currently work at MiroMind, focusing on RL-Infra and RL algorithm development with a focus on Agentic-RL. I contributed the fully-async scheduling architecture to OpenRLHF, an open-source, Ray-based high-performance RLHF framework widely adopted by the community. My research spans LLM reasoning, MCTS, scaling laws for agentic reasoning, and efficient reasoning compression.

Research Interests

RL-Infra Agentic-RL LLM Reasoning MCTS RLHF Scaling Laws NLP LLM Safety

Work Experience

2025.11 – Present

MiroMind

RL-Infra & RL Algorithms (Agentic-RL)

2024 – 2025.8

Xiaohongshu (小红书) — HiLab

Math-Reasoning & Agentic-RL

2023

Douyin (抖音 / ByteDance)

AIGC Safety

2018 – 2023

Alibaba (阿里巴巴)

Content Safety, Public Opinion Analysis, Personal Privacy, LLM Safety

Education

2013 – 2016

Tsinghua University (清华大学)

Department of Electronic Engineering

2009 – 2013

UESTC (电子科技大学)

Bachelor's Degree

2025 Highlights

Top-venue publications accepted in 2025–2026

NeurIPS 2025
45 citations

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

X Mai, H Xu, W Wang, Y Zhang, W Zhang

Discovering scaling laws for agentic reinforcement learning where LLMs spontaneously learn to execute code for mathematical reasoning.

ICLR 2026
17 citations

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

G Chen, Z Qiao, X Chen, D Yu, H Xu, WX Zhao, R Song, W Yin, H Yin, et al.

Rethinking how to scale agent interactions for long-horizon research tasks.

ICLR 2026
11 citations

GEM: A Gym for Generalist LLMs

Z Liu, A Sims, K Duan, C Chen, S Yu, X Zhou, H Xu, S Xiong, B Liu, C Tan, et al.

A comprehensive gym environment for training and evaluating generalist large language models.

ICLR 2026
9 citations

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

H Lu, Y Wen, P Cheng, R Ding, J Guo, H Xu, C Wang, H Chen, X Jiang, et al.

A novel self-play framework that advances agent capabilities through search without external supervision.

ICLR 2026

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

S Peng, W Wang, Z Tian, S Yang, X Wu, H Xu, C Zhang, T Isobe, B Hu, M Zhang

A unified dynamic preference optimization framework that adaptively reweights samples by jointly considering data quality and model performance evolution during training.

IEEE TPAMI 2025
386 citations

From System 1 to System 2: A Survey of Reasoning Large Language Models

D Zhang, ZZ Li, ML Zhang, J Zhang, Z Liu, Y Yao, H Xu, J Zheng, X Chen, et al.

A comprehensive survey of reasoning LLMs, tracing the evolution from fast intuitive (System 1) to slow deliberate (System 2) reasoning paradigms.

EMNLP 2025
304 citations

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

J Hu, X Wu, W Shen, JK Liu, W Wang, S Jiang, H Wang, H Chen, B Chen, H Xu, et al.

A Ray-based open-source framework enabling scalable RLHF training, widely adopted by the research community.

ACL 2024
14 citations

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

W Sun, H Xu, X Yu, P Chen, S He, J Zhao, K Liu

Showing that LLMs can bootstrap inductive reasoning capability through deductive self-teaching.

ACL 2026
7 citations

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

ZZ Li, X Liang, Z Tang, L Ji, P Wang, H Xu, H Huang, W Deng, Y Gong, et al.

An efficient method to compress long chain-of-thought reasoning while preserving accuracy.

Selected Publications

Full list on Google Scholar

2025

From System 1 to System 2: A Survey of Reasoning Large Language Models

D Zhang, ZZ Li, ML Zhang, J Zhang, Z Liu, Y Yao, H Xu, et al.

IEEE Transactions on Pattern Analysis and Machine Intelligence 386 citations
2025

Reinforce++: Stabilizing Critic-free Policy Optimization with Global Advantage Normalization

J Hu, JK Liu, H Xu, W Shen

arXiv preprint 351 citations
2025

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

J Hu, X Wu, W Shen, JK Liu, W Wang, S Jiang, H Wang, H Chen, B Chen, H Xu, et al.

EMNLP 2025 304 citations
2023

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

G Xu, J Liu, M Yan, H Xu, J Si, Z Zhou, P Yi, X Gao, J Sang, R Zhang, et al.

arXiv preprint 104 citations
2025

RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

H Xu, X Wu, W Wang, Z Li, D Zheng, B Chen, Y Hu, S Kang, J Ji, Y Zhang, et al.

arXiv preprint 63 citations
2024

Interpretable Contrastive Monte Carlo Tree Search Reasoning

Z Gao, B Niu, X He, H Xu, H Liu, A Liu, X Hu, L Wen

arXiv preprint 59 citations
2025

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

X Mai, H Xu, W Wang, Y Zhang, W Zhang

NeurIPS 2025 45 citations
2025

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

W Sun, X Yu, Z Huang, H Xu, S He, J Zhao, K Liu

CCL 2025 27 citations
2023

No Train Still Gain. Unleash Mathematical Reasoning of LLMs with Monte Carlo Tree Search Guided by Energy Function

H Xu

arXiv preprint 26 citations
2022

Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

J Si, X Peng, C Li, H Xu, J Li

ICASSP 2022 22 citations
2026

IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

G Chen, Z Qiao, X Chen, D Yu, H Xu, WX Zhao, R Song, W Yin, H Yin, et al.

ICLR 2026 17 citations
2024

ITD: Large Language Models Can Teach Themselves Induction Through Deduction

W Sun, H Xu, X Yu, P Chen, S He, J Zhao, K Liu

ACL 2024 14 citations
2026

GEM: A Gym for Generalist LLMs

Z Liu, A Sims, K Duan, C Chen, S Yu, X Zhou, H Xu, S Xiong, B Liu, C Tan, et al.

ICLR 2026 11 citations
2026

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

H Lu, Y Wen, P Cheng, R Ding, J Guo, H Xu, C Wang, H Chen, X Jiang, et al.

ICLR 2026 9 citations
2026

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

S Peng, W Wang, Z Tian, S Yang, X Wu, H Xu, C Zhang, T Isobe, B Hu, M Zhang

ICLR 2026
2025

Probabilistic Uncertain Reward Model

W Sun, X Cheng, X Yu, H Xu, Z Yang, S He, J Zhao, K Liu

arXiv preprint 7 citations
2026

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

ZZ Li, X Liang, Z Tang, L Ji, P Wang, H Xu, et al.

ACL 2026 7 citations
2020

End-to-end Latent-variable Task-oriented Dialogue System with Exact Log-likelihood Optimization

H Xu, H Peng, H Xie, E Cambria, L Zhou, W Zheng

World Wide Web 23(3) 45 citations

Zhihu Contributions

Technical writing on LLM reasoning, MCTS, and reinforcement learning

haotian

Research interests: Probabilistic Graphical Models, Reinforcement Learning, NLP

485+ Upvotes 8+ Answers
Visit my Zhihu Profile

Get in Touch