Spaces:
Running
Running
HEAD_TEXT = """ | |
Based on the ποΈRACE benchmark, we demonstrated the ability of different LLMs to generate code that is **_correct_** and **_meets the requirements of real-world development scenarios_**. | |
More details about how to evalute the LLM are available in the [ποΈRACE GitHub repository](https://github.com/jszheng21/RACE). For a complete description of RACE benchmark and related experimental analysis, please refer to the paper: [Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models](https://arxiv.org/abs/2407.11470). | |
**_Latest News_** π₯ | |
- [24/10/09] We release the second version of [RACE paper](https://arxiv.org/abs/2407.11470). | |
- [24/10/09] We add the evaluation results of 9 LLMs (including `o1-mini-2024-09-12`) in [RACE leaderboard](https://huggingface.co/spaces/jszheng/RACE_leaderboard). | |
- [24/10/01] We have improved the calculation methods for readability-related metrics and enhanced the robustness of the code post-processing techniques. | |
- [24/10/01] We have revised the test code in the LeetCode evaluation data to support the cases with multiple correct answers. | |
- [24/07/24] We add the evaluation results of `claude-3.5-sonnet` and `Qwen2-72B-Instruct` in [RACE leaderboard](https://huggingface.co/spaces/jszheng/RACE_leaderboard). | |
- [24/07/16] We release our RACE benchmark, leaderboard and paper. | |
""" | |
ABOUT_TEXT = """# What is RACE benchmark? | |
RACE is a multi-dimensional benchmark for code generation that focuses on **R**eadability, m**A**intainability, **C**orrectness, and **E**fficiency. | |
Its goal is to evaluate LLM's ability to generate code that is correct and meets the requirements of real-world development scenarios. | |
The benchmark is designed with various real-world demands across different **_demand-dependent_** dimensions, making it more applicable to practical scenarios. | |
# What are the specific aspects to be evaluated? | |
We have summarized representative influencing factors in real-world scenarios for different dimensions and designed various requirements for each factor. | |
These have been incorporated into the task description to prompt the LLM to generate code that is correct and meets the specified requirements. | |
The specific factors are as follows: | |
- **Readability**: The code should be easy to read and understand. | |
- `Comment` | |
- `Naming Convention` | |
- `Length` | |
- **Maintainability**: The code should be easy to maintain and extend. | |
- `MI Metric` | |
- `Modularity` | |
- **Efficiency**: The code should be efficient in terms of time and space complexity. | |
- `Time Complexity` | |
- `Space Complexity` | |
# How to evaluate? | |
To facilitate evaluation on the RACE benchmark, we provide the evaluation data and easy-to-use evaluation scripts in our [ποΈRACE GitHub repository](https://github.com/jszheng21/RACE). | |
Additionally, factors involving execution-based evaluation are conducted in a virtual environment to ensure evaluation security. | |
# Contact | |
If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]). | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r""" | |
@misc{zheng2024race, | |
title={Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models}, | |
author={Jiasheng Zheng and Boxi Cao and Zhengzhao Ma and Ruotong Pan and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun}, | |
year={2024}, | |
eprint={2407.11470}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.SE}, | |
url={https://arxiv.org/abs/2407.11470}, | |
} | |
""" | |
ACKNOWLEDGEMENT_TEXT = """ | |
Inspired from the [π€ Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). | |
""" | |
NOTES_TEXT = """ | |
**Notes:** | |
- `π― RACE Score` denotes the final evaluation result based on ποΈRACE benchmark, which is the average of the scores in the four dimensions: `β Correctness`, `π Readability`, `π¨ Maintainability`, and `π Efficiency`. | |
- All fine-grained evaluation results are provided in `β¬ Hidden Columns`. `π R` denotes code **R**eadability, `π¨ M` denotes code **M**aintainability, and `π E` denotes code **E**fficiency. `*` denotes the code accuracy in the absence of customized instructions. More details about the abbreviations are as follows: | |
- `π R*`: The code accuracy (baesline). | |
- `π RN`: The proportion of code that is both functionally correct and follows customized instructions related to `Naming Convention`. | |
- `π RL`: The proportion of code that is both functionally correct and follows customized instructions related to `Code Length`. | |
- `π RC`: The proportion of code that is both functionally correct and follows customized instructions related to `Comment`. | |
- `π¨ MI*`: The code accuracy related to `Maintainability Index` (baesline). | |
- `π¨ MI`: The proportion of code that is both functionally correct and follows customized instructions related to `MI Metric`. | |
- `π¨ MC*`: The code accuracy related to `Modularity` (baesline). | |
- `π¨ MC`: The proportion of code that is both functionally correct and follows customized instructions related to `Modularity`. | |
- `π E*`: The code accuracy (baesline). | |
- `π E_NI_T`: The proportion of code that is both functionally correct and follows customized instructions related to `Time Complexity`. | |
- `π E_NI_S`: The proportion of code that is both functionally correct and follows customized instructions related to `Space Complexity`. | |
- Regarding the types of evaluation results, `π¨ MI`, `π E_NI_T`, and `π E_NI_S` are scalar values ranging from 0 to 100, while the remaining metrics are percentages. | |
- For more explanation check the π About section. | |
""" |