This research investigates the critical risk of wireheading in language models where self-evaluation drives reward signals, revealing how gradient-based feed...
Level: advanced
By David Demitri Africa, Hans Ethan Ting
Category: research