This PhD thesis investigates critical safety failures in aligned AI agents, introducing ACDC for circuit discovery and Latent Adversarial Training to mitigat...
Level: expert
By Aengus Lynch
Category: discussion