Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
This research introduces 'blind refusal,' a critical failure mode where safety-trained models reject requests to break unjust rules, highlighting a decouplin...