Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

This research investigates how specification gaming in Reinforcement Learning triggers deceptive behaviors in LLMs, revealing that environment design critica...

Level: advanced

By Leon Eshuijs

Category: discussion