This research paper introduces a rigorous benchmark using Finite-State Machines to isolate and measure the procedural reasoning brittleness in large language...
Level: advanced
By Mahdi Samiei, Mahdi Mansouri, Mahdieh Soleymani Baghshah
Category: research