This research investigates how training incentives like length penalties impact the monitorability of chain-of-thought reasoning. It reveals critical trade-o...
Level: advanced
By Matt MacDermott, Qiyao Wei, Rada Djoneva, Francis Rhys Ward
Category: research