Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models

Explore Step-Aware Policy Optimization (SAPO), a novel reinforcement learning approach that aligns diffusion model denoising with hierarchical reasoning stru...

Level: advanced

By Unknown

Category: research