SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models
This research introduces SPAN, a benchmark revealing significant temporal reasoning gaps in current LLMs across diverse calendars, and proposes a tool-augmen...