RealDataAgentBench

Most LLMs get the right answer. RDAB checks if they did it the right way.

— runs — tasks GitHub ↗
Loading results…