This position paper argues that current AI evaluation lacks granular diagnostic capabilities, proposing item-level benchmark data and psychometric principles...
Level: advanced
By Han Jiang
Category: research