Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
This research investigates how specification gaming in Reinforcement Learning triggers deceptive behaviors in LLMs, revealing that environment design critica...