|
Yixiu Mao | 毛逸休
I'm a fifth-year Ph.D. student in the Department of Automation at Tsinghua University, advised by Prof. Xiangyang Ji.
My research focuses on Reinforcement Learning and Large Reasoning Models.
I work with the THU-IDM team, where we develop efficient algorithms for decision-making.
Prior to my doctoral studies, I received my Bachelor's degree in Physics from Zhiyuan Honor College at Shanghai Jiao Tong University.
Email /
Scholar /
Github
|
|
News
- [2026-02] I am currently exploring job opportunities. If my experience and research interests align with your needs, I would be delighted to connect.
- [2021-09] Started my Ph.D. journey at Tsinghua University.
|
Research
I'm interested in Reinforcement Learning and Large Language Models. My research focuses on efficient and intelligent decision-making with minimal environment interactions.
|
Selected Publications
* denotes co-first authors
|
|
|
Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
Yixiu Mao,
Yun Qu,
Qi Wang,
Heming Zou,
Xiangyang Ji
ICLR, 2026
Online predicts and selects informative prompts prior to rollout by inferring their learning dynamics, accelerating RL finetuning of large reasoning models.
|
|
|
Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning
Yixiu Mao,
Yun Qu,
Qi Wang,
Xiangyang Ji
NeurIPS, 2025 Spotlight
paper
/
code
A new constrained optimization paradigm for offline RL.
|
|
|
Doubly Mild Generalization for Offline Reinforcement Learning
Yixiu Mao,
Qi Wang,
Yun Qu,
Yuhang Jiang,
Xiangyang Ji
NeurIPS, 2024
paper
/
code
Appropriately exploit generalization in offline RL.
|
|
|
Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression
Yixiu Mao,
Qi Wang,
Chen Chen,
Yun Qu,
Xiangyang Ji
NeurIPS, 2024
paper
/
code
A simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL.
|
|
|
Supported Value Regularization for Offline Reinforcement Learning
Yixiu Mao,
Hongchang Zhang,
Chen Chen,
Yi Xu,
Xiangyang Ji
NeurIPS, 2023
paper
/
code
Theoretically grounded value regularization for offline RL.
|
|
|
Supported Trust Region Optimization for Offline Reinforcement Learning
Yixiu Mao,
Hongchang Zhang,
Chen Chen,
Yi Xu,
Xiangyang Ji
ICML, 2023
paper
/
code
Theoretically grounded policy optimization for offline RL.
|
|
|
In-Sample Actor Critic for Offline Reinforcement Learning
Hongchang Zhang*,
Yixiu Mao*,
Boyuan Wang,
Shuncheng He,
Yi Xu,
Xiangyang Ji
ICLR, 2023
paper
In-sample learning for offline RL, avoiding extrapolation error.
|
|
|
A Hypergradient Approach to Robust Regression without Correspondence
Yujia Xie*,
Yixiu Mao*,
Simiao Zuo,
Hongteng Xu,
Xiaojing Ye,
Tuo Zhao,
Hongyuan Zha
ICLR, 2021
paper
/
code
We consider a regression problem, where the correspondence between input and output data is not available.
|
|
|
Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments
Yun Qu*,
Qi Wang*,
Yixiu Mao*,
Yiqin Lv,
Xiangyang Ji
ICML, 2025
project page
/
paper
/
code
We propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making.
|
|
|
Robust Fast Adaptation from Adversarially Explicit Task Distribution Generation
Cheems Wang*,
Yiqin Lv*,
Yixiu Mao*,
Yun Qu,
Yi Xu,
Xiangyang Ji
KDD, 2025
project page
/
paper
/
code
We consider explicitly generative modeling task distributions placed over task identifiers and propose robustifying fast adaptation from adversarial training.
|
|
|
Model Predictive Task Sampling for Efficient and Robust Adaptation
Qi Wang*,
Zehao Xiao*,
Yixiu Mao*,
Yun Qu*,
Jiayi Shen,
Yiqin Lv,
Xiangyang Ji
arxiv, 2025
paper
/
code
We introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk landscape, providing a theoretical foundation for robust active task sampling.
|
|
|
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
Yun Qu,
Qi Wang,
Yixiu Mao,
Vincent Tao Hu,
Björn Ommer,
Xiangyang Ji
KDD, 2026
paper
/
code
This work introduces Model Predictive Prompt Selection, a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions.
|
|
|
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning
Yun Qu*,
Yuhang Jiang*,
Boyuan Wang,
Yixiu Mao,
Qi Wang,
Chang Liu,
Xiangyang Ji
AAAI, 2025
paper
/
code
We introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment in episodic reinforcement learning.
|
|
|
Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Zhiyu Mou,
Yiqin Lv,
Miao Xu,
Qi Wang,
Yixiu Mao,
Qichen Ye,
Chao Li,
Rongquan Bai,
Chuan Yu,
Jian Xu,
Bo Zheng
ICLR, 2026 Oral
paper
|
×
|