This research investigates how on-policy data in Reinforcement Learning mitigates catastrophic forgetting during LLM post-training, demonstrating superior re...
Level: advanced
By Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen
Category: research