Quagmires in SFT-RL Post-Training

Explore the critical disconnect between high SFT scores and actual RLVR performance. This research introduces robust metrics like generalization loss to ensu...

Level: advanced

By Unknown

Category: research