Code Evaluation
Collection of Papers on Code Evaluation (from code generation language models)
Paper • 2311.07989 • Published • 21Note great overview and a lot of additional references! frequently updated list: https://github.com/codefuse-ai/Awesome-Code-LLM
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 6Note introduces HumanEval and pass@k
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4Note has already been used for marketing: Devin Most promising benchmark currently. They model the actual software engineer by using github issues as inputs, the whole repository as resource. This benchmark tests systems - not just models. So you can have agent like managers and retrieval systems. rapidly advancing leaderboard: https://www.swebench.com/
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 1Note an older work that consists of several easy task for encoder and decoder models. For example line completion or min/max cloze test
Out of the BLEU: how should we assess quality of the Code Generation models?
Paper • 2208.03133 • Published • 2Note human judgement doesn't agree with static metrics (BLEU, ChrF, RUBY ...)
ReCode: Robustness Evaluation of Code Generation Models
Paper • 2212.10264 • Published • 1Note perturbations on docstrings/prompts
- Running982📈
Big Code Models Leaderboard
Textbooks Are All You Need
Paper • 2306.11644 • Published • 142Note phi-1 model, novel evaluation problems that combine two tasks to make data contamination less likely.
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 87
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Paper • 2403.07974 • Published • 1Note Annotate problems by months and spot potential contamination
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
Paper • 2310.11248 • Published • 3CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Paper • 2302.05527 • Published • 1
A Static Evaluation of Code Completion by Large Language Models
Paper • 2306.03203 • Published • 3Note model context is whole file up untill funciton header, ground turth the source file, first AST, then Linter, errors in context cause errors in generation, undefined/name error most common, EOF error due to generation length
Large Language Models Are State-of-the-Art Evaluators of Code Generation
Paper • 2304.14317 • Published • 2Measuring Coding Challenge Competence With APPS
Paper • 2105.09938 • Published • 1MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation
Paper • 2208.08227 • Published • 1
Program Synthesis with Large Language Models
Paper • 2108.07732 • Published • 4Note introduces Mostly Basic Programming Problems (MBPP)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
Paper • 2303.12570 • PublishedRepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Paper • 2306.03091 • Published • 1TACO: Topics in Algorithmic COde generation dataset
Paper • 2312.14852 • Published • 4CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11- Running415🏆
Can Ai Code Results
Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and Shortcomings in Code Generation Evaluation
Paper • 2401.03855 • PublishedNote has been renamed "Python Saga". Figure 6 is a great sign of saturation.
NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
Paper • 2401.15963 • PublishedNote "non functional" -> focusses on instruct models, has classification tasks and non functional requirements like efficiency, security and also maintability. Still relies on gold labels using DiffBLEU. filters examples larger than 3k tokens using starcoder tokenizer GPT4 seems to be really good on these tasks. might be a lot of prompt engineering tho
DevEval: Evaluating Code Generation in Practical Software Projects
Paper • 2401.06401 • PublishedNote this paper has been withdrawn!
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4Note HumanEvalPlus with additional test cases has a great figure to rank HumanEval problems on passrate, showcasing that some of them are much easier than others.
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
Paper • 2306.04556 • Published
CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation
Paper • 2404.08806 • PublishedNote Verilog generation. creativity here means ". This term refers to the capacity to think innovatively—the ability to formulate new solutions or connections that are effective and unconventional [11]." where the reference is: https://doi.org/10.1080/10400419.2012.650092 "fluency" is how many of the pass@k variants are unique?(but this will be skewed towards larger models, right - they sample wider)
Benchmarking Language Model Creativity: A Case Study on Code Generation
Paper • 2407.09007 • Published • 3Note NeoCoder: denial prompting to get novel approaches, even outside of "historical human solutions". (cont)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
Paper • 2312.12450 • Published • 1- Running136🥇
BigCodeBench Leaderboard
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Paper • 2406.15877 • Published • 45Execution-Based Evaluation for Open-Domain Code Generation
Paper • 2212.10481 • Published • 1
On Leakage of Code Generation Evaluation Datasets
Paper • 2407.07565 • Published • 5Note contamination directly: every humaneval problem is at least 43 times on github contamination indirectly: high similarity with synthetic rephrases for instruction learning overfitting on benchmarks: humaneval and MBPP should be considered "dev sets" also add LBPP (less basic ....)
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Paper • 2403.04811 • PublishedClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
Paper • 2308.01861 • Published • 1Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code
Paper • 2308.03109 • Published • 1InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
Paper • 2404.07940 • PublishedTop Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
Paper • 2403.19114 • PublishedA Systematic Evaluation of Large Language Models of Code
Paper • 2202.13169 • Published • 1What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
Paper • 2407.06153 • PublishedTo Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 40DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Paper • 2211.11501 • PublishedmHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
Paper • 2410.15037 • Published