Does Self-Evaluation Enable Wireheading in Language Models?

This research investigates the critical risk of wireheading in language models where self-evaluation drives reward signals, revealing how gradient-based feed...

Level: advanced

By David Demitri Africa, Hans Ethan Ting

Category: research