RACE_leaderboard / text_content.py
jszheng's picture
add results of 9 LLMs
790dc55
HEAD_TEXT = """
Based on the 🏎️RACE benchmark, we demonstrated the ability of different LLMs to generate code that is **_correct_** and **_meets the requirements of real-world development scenarios_**.
More details about how to evalute the LLM are available in the [🏎️RACE GitHub repository](https://github.com/jszheng21/RACE). For a complete description of RACE benchmark and related experimental analysis, please refer to the paper: [Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models](https://arxiv.org/abs/2407.11470).
**_Latest News_** πŸ”₯
- [24/10/09] We release the second version of [RACE paper](https://arxiv.org/abs/2407.11470).
- [24/10/09] We add the evaluation results of 9 LLMs (including `o1-mini-2024-09-12`) in [RACE leaderboard](https://huggingface.co/spaces/jszheng/RACE_leaderboard).
- [24/10/01] We have improved the calculation methods for readability-related metrics and enhanced the robustness of the code post-processing techniques.
- [24/10/01] We have revised the test code in the LeetCode evaluation data to support the cases with multiple correct answers.
- [24/07/24] We add the evaluation results of `claude-3.5-sonnet` and `Qwen2-72B-Instruct` in [RACE leaderboard](https://huggingface.co/spaces/jszheng/RACE_leaderboard).
- [24/07/16] We release our RACE benchmark, leaderboard and paper.
"""
ABOUT_TEXT = """# What is RACE benchmark?
RACE is a multi-dimensional benchmark for code generation that focuses on **R**eadability, m**A**intainability, **C**orrectness, and **E**fficiency.
Its goal is to evaluate LLM's ability to generate code that is correct and meets the requirements of real-world development scenarios.
The benchmark is designed with various real-world demands across different **_demand-dependent_** dimensions, making it more applicable to practical scenarios.
# What are the specific aspects to be evaluated?
We have summarized representative influencing factors in real-world scenarios for different dimensions and designed various requirements for each factor.
These have been incorporated into the task description to prompt the LLM to generate code that is correct and meets the specified requirements.
The specific factors are as follows:
- **Readability**: The code should be easy to read and understand.
- `Comment`
- `Naming Convention`
- `Length`
- **Maintainability**: The code should be easy to maintain and extend.
- `MI Metric`
- `Modularity`
- **Efficiency**: The code should be efficient in terms of time and space complexity.
- `Time Complexity`
- `Space Complexity`
# How to evaluate?
To facilitate evaluation on the RACE benchmark, we provide the evaluation data and easy-to-use evaluation scripts in our [🏎️RACE GitHub repository](https://github.com/jszheng21/RACE).
Additionally, factors involving execution-based evaluation are conducted in a virtual environment to ensure evaluation security.
# Contact
If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]).
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{zheng2024race,
title={Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models},
author={Jiasheng Zheng and Boxi Cao and Zhengzhao Ma and Ruotong Pan and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun},
year={2024},
eprint={2407.11470},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2407.11470},
}
"""
ACKNOWLEDGEMENT_TEXT = """
Inspired from the [πŸ€— Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
"""
NOTES_TEXT = """
**Notes:**
- `πŸ’― RACE Score` denotes the final evaluation result based on 🏎️RACE benchmark, which is the average of the scores in the four dimensions: `βœ… Correctness`, `πŸ“– Readability`, `πŸ”¨ Maintainability`, and `πŸš€ Efficiency`.
- All fine-grained evaluation results are provided in `⏬ Hidden Columns`. `πŸ“– R` denotes code **R**eadability, `πŸ”¨ M` denotes code **M**aintainability, and `πŸš€ E` denotes code **E**fficiency. `*` denotes the code accuracy in the absence of customized instructions. More details about the abbreviations are as follows:
- `πŸ“– R*`: The code accuracy (baesline).
- `πŸ“– RN`: The proportion of code that is both functionally correct and follows customized instructions related to `Naming Convention`.
- `πŸ“– RL`: The proportion of code that is both functionally correct and follows customized instructions related to `Code Length`.
- `πŸ“– RC`: The proportion of code that is both functionally correct and follows customized instructions related to `Comment`.
- `πŸ”¨ MI*`: The code accuracy related to `Maintainability Index` (baesline).
- `πŸ”¨ MI`: The proportion of code that is both functionally correct and follows customized instructions related to `MI Metric`.
- `πŸ”¨ MC*`: The code accuracy related to `Modularity` (baesline).
- `πŸ”¨ MC`: The proportion of code that is both functionally correct and follows customized instructions related to `Modularity`.
- `πŸš€ E*`: The code accuracy (baesline).
- `πŸš€ E_NI_T`: The proportion of code that is both functionally correct and follows customized instructions related to `Time Complexity`.
- `πŸš€ E_NI_S`: The proportion of code that is both functionally correct and follows customized instructions related to `Space Complexity`.
- Regarding the types of evaluation results, `πŸ”¨ MI`, `πŸš€ E_NI_T`, and `πŸš€ E_NI_S` are scalar values ranging from 0 to 100, while the remaining metrics are percentages.
- For more explanation check the πŸ“ About section.
"""