WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

WebDevJudge introduces a rigorous benchmark for assessing how Large Language Models and Multimodal LLMs critique web development quality, revealing critical ...

Level: advanced

By Chunyang Li and 7 other authors

Category: research