Running on CPU Upgrade 11.9k π Open LLM Leaderboard 2 Track, rank and evaluate open LLMs and chatbots
WebArena: A Realistic Web Environment for Building Autonomous Agents Paper β’ 2307.13854 β’ Published Jul 25, 2023 β’ 23
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java Paper β’ 2408.14354 β’ Published Aug 26 β’ 40