Imbalanced Gradients in RL Post-Training of Multi-Task LLMs

This research investigates gradient imbalance in multi-task RL post-training, revealing how task mixing causes unstable policy updates and suboptimal converg...

Level: advanced

By Runzhe Wu and 9 other authors

Category: research