ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training
This research addresses critical instability in multi-turn LLM training by introducing ST-PPO, a stabilized variant of Proximal Policy Optimization that util...