Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Explore how mechanistic interpretability and Layer-Patching reveal the Knobe effect in finetuned LLMs, offering a pathway to eliminate social biases without ...

Level: advanced

By Unknown

Category: discussion