Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
This research introduces RWE-bench, a benchmark evaluating LLM agents on generating real-world medical evidence from databases like MIMIC-IV, revealing signi...