Hello, I'm
RL-Infra & RL Algorithms · Agentic-RL
MiroMind
I currently work at MiroMind, focusing on RL-Infra and RL algorithm development with a focus on Agentic-RL. I contributed the fully-async scheduling architecture to OpenRLHF, an open-source, Ray-based high-performance RLHF framework widely adopted by the community. My research spans LLM reasoning, MCTS, scaling laws for agentic reasoning, and efficient reasoning compression.
RL-Infra & RL Algorithms (Agentic-RL)
Math-Reasoning & Agentic-RL
AIGC Safety
Content Safety, Public Opinion Analysis, Personal Privacy, LLM Safety
Department of Electronic Engineering
Bachelor's Degree
Top-venue publications accepted in 2025–2026
Discovering scaling laws for agentic reinforcement learning where LLMs spontaneously learn to execute code for mathematical reasoning.
Rethinking how to scale agent interactions for long-horizon research tasks.
A comprehensive gym environment for training and evaluating generalist large language models.
A novel self-play framework that advances agent capabilities through search without external supervision.
A unified dynamic preference optimization framework that adaptively reweights samples by jointly considering data quality and model performance evolution during training.
A comprehensive survey of reasoning LLMs, tracing the evolution from fast intuitive (System 1) to slow deliberate (System 2) reasoning paradigms.
A Ray-based open-source framework enabling scalable RLHF training, widely adopted by the research community.
Showing that LLMs can bootstrap inductive reasoning capability through deductive self-teaching.
An efficient method to compress long chain-of-thought reasoning while preserving accuracy.
Full list on Google Scholar
Technical writing on LLM reasoning, MCTS, and reinforcement learning
Research interests: Probabilistic Graphical Models, Reinforcement Learning, NLP
System 1 vs System 2 thinking paradigm with MCTS for LLM reasoning
Insights on reproducing OpenAI o1 with long-CoT distillation and DPO
Reward shaping, domain randomization, and what zero-RL really optimizes
Deep dive into RL implementation details: loss, gradient accumulation, GRPO variance
Simplifying LLM-RL: start minimal, add complexity only when needed
Zero-shot tool-integrated reasoning with RL — small models, big results
Async rollout + pipeline: 4x faster training built on OpenRLHF
Training-inference consistency for MoE models under RL
Stabilizing off-policy RL training under high staleness