Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

This survey explores the systemic vulnerability of reward hacking in large language models, introducing the Proxy Compression Hypothesis to explain emergent ...

Level: advanced

By Xiaohua Wang

Category: discussion