This research dissects how Large Language Models transition from refusal to compliance using sparse autoencoders to identify dormant jailbreak features. It i...
Level: advanced
By Unknown
Category: discussion