NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts Paper • 2405.04520 • Published May 7 • 1
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Paper • 2411.02337 • Published about 14 hours ago • 14
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents Paper • 2410.24024 • Published 5 days ago • 27