Fantastic Bugs and Where to Find Them in AI Benchmarks

This research introduces a statistical framework to detect invalid questions in AI benchmarks by modeling performance as a latent construct, significantly re...

Level: advanced

By Sang Truong and 10 other authors

Category: research