Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

This research introduces a novel dataset and gradient-based algorithm for fine-grained, real-time safety monitoring of AI chain-of-thought reasoning using ac...

Level: advanced

By Antonio-Gabriel Chacón Menke, Phan Xuan Tan, Eiji Kamioka

Category: discussion