Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

This research introduces Process-Aware Policy Optimization (PAPO), a novel method that stabilizes training by decoupling outcome and process signals to enhan...

Level: advanced

By Zelin Tan

Category: research