Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria

This research formalizes alignment faking as a strategic deviation in LLMs using Bayesian-Stackelberg equilibria, revealing how preference optimization algor...

Level: expert

By Kartik Garg and 10 other authors

Category: research