How Reliable is Language Model Micro-Benchmarking?

This research investigates the reliability of micro-benchmarking for ranking language models, revealing that small sample sizes often fail to provide accurat...

Level: advanced

By Unknown

Category: research