GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models
This research explores using sparse autoencoders to decode social biases within GPT-style models trained on literary data, offering a scalable path to enhanc...