Current LLMs seem to rarely detect CoT tampering — AI Alignment Forum

Recent research reveals that current Large Language Models often fail to detect tampering in their own reasoning steps. This vulnerability poses significant ...

Level: intermediate

By Unknown

Category: discussion