-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 16 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 9 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 41 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 46
Collections
Discover the best community collections!
Collections including paper arxiv:2401.03065
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Paper • 2401.14196 • Published • 47 -
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation
Paper • 2312.14187 • Published • 49 -
On the Effectiveness of Large Language Models in Domain-Specific Code Generation
Paper • 2312.01639 • Published • 1
-
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
Paper • 2401.03506 • Published • 13 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19 -
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Paper • 2402.14083 • Published • 47
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
Learning Vision from Models Rivals Learning Vision from Data
Paper • 2312.17742 • Published • 15 -
PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation
Paper • 2312.17276 • Published • 15 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 14
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper • 2311.12022 • Published • 25 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 183 -
gorilla-llm/APIBench
Updated • 115 • 63 -
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper • 2312.04724 • Published • 20
-
Creative Robot Tool Use with Large Language Models
Paper • 2310.13065 • Published • 8 -
CodeCoT and Beyond: Learning to Program and Test like a Developer
Paper • 2308.08784 • Published • 5 -
Lemur: Harmonizing Natural Language and Code for Language Agents
Paper • 2310.06830 • Published • 31 -
CodePlan: Repository-level Coding using LLMs and Planning
Paper • 2309.12499 • Published • 73
-
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning
Paper • 2310.04484 • Published • 5 -
Diversity of Thought Improves Reasoning Abilities of Large Language Models
Paper • 2310.07088 • Published • 5 -
Adapting Large Language Models via Reading Comprehension
Paper • 2309.09530 • Published • 77 -
Democratizing Reasoning Ability: Tailored Learning from Large Language Model
Paper • 2310.13332 • Published • 14
-
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Paper • 2310.15123 • Published • 7 -
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
Paper • 2310.13227 • Published • 12 -
LASER: LLM Agent with State-Space Exploration for Web Navigation
Paper • 2309.08172 • Published • 11 -
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Paper • 2310.04406 • Published • 8