This research investigates gradient imbalance in multi-task RL post-training, revealing how task mixing causes unstable policy updates and suboptimal converg...
Level: advanced
By Runzhe Wu and 9 other authors
Category: research