This research investigates the reliability of micro-benchmarking for ranking language models, revealing that small sample sizes often fail to provide accurat...
Level: advanced
By Unknown
Category: research