Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Explore Online Supervised Fine-Tuning (OSFT), a novel reward-free protocol that leverages latent pretraining preferences to enhance LLM reasoning. This resea...

Level: advanced

By Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

Category: research