R²PO introduces a novel architectural approach to decouple training trajectories from inference responses in LLMs, enhancing reasoning accuracy while maintai...
Level: advanced
By Jingchu Wang, Bingbing Xu, Yige Yuan, Bin Xie, Xiaoqian Sun, Huawei Shen
Category: research