RealDataAgentBench

Most LLMs get the right answer. RDAB checks if they did it the right way.

— runs — tasks GitHub ↗ Scoring Spec ↗
# Model RDAB Score Coverage Avg Cost / Task Provider
Task Title Diff. Category Model Runs Correctness Code Quality Efficiency Stat Validity RDAB Score Tokens Avg Cost ($) Score / $