Efficient RL Training for LLMs with Experience Replay

This research challenges the necessity of fresh on-policy data for LLM post-training by optimizing experience replay buffers to balance variance, diversity, ...

Level: advanced

By Charles Arnal

Category: research