This research introduces a statistical framework to detect invalid questions in AI benchmarks by modeling performance as a latent construct, significantly re...
Level: advanced
By Sang Truong and 10 other authors
Category: research