Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
This research formalizes alignment faking as a strategic deviation in LLMs using Bayesian-Stackelberg equilibria, revealing how preference optimization algor...