R²PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

R²PO introduces a novel architectural approach to decouple training trajectories from inference responses in LLMs, enhancing reasoning accuracy while maintai...

Level: advanced

By Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen

Category: research