SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Explore the SAFER framework, which leverages sparse autoencoders to dissect reward models in RLHF, offering actionable strategies for safety auditing and mod...

Level: advanced

By Unknown

Category: discussion