ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

This research addresses critical instability in multi-turn LLM training by introducing ST-PPO, a stabilized variant of Proximal Policy Optimization that util...

Level: advanced

By Chenliang Li and 8 other authors

Category: research