ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Explore ClawsBench, a novel benchmark evaluating the capability and safety of LLM agents within high-fidelity simulated productivity environments like Gmail ...

Level: advanced

By Xiangyi Li

Category: research