|
HEAD_TEXT = """ |
|
This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. |
|
|
|
Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper](https://arxiv.org/abs/2408.03281) for experimental analysis. |
|
|
|
🚀 **_Latest News_** |
|
* [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥. |
|
|
|
* [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥. |
|
""" |
|
|
|
ABOUT_TEXT = """# What is StructEval? |
|
Evaluation is the baton for the development of large language models. |
|
Current evaluations typically employ *a single-item assessment paradigm* for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions. |
|
To this end, we propose a novel evaluation framework referred to as ***StructEval***. |
|
Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a **structured assessment across multiple cognitive levels and critical concepts**, and therefore offers a comprehensive, robust and consistent evaluation for LLMs. |
|
Experiments demonstrate that **StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases**, thereby providing more reliable and consistent conclusions regarding model capabilities. |
|
Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols. |
|
|
|
# How to evaluate? |
|
Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework. |
|
|
|
# Contact |
|
If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]). |
|
""" |
|
|
|
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" |
|
|
|
CITATION_BUTTON_TEXT = r""" |
|
comming soon |
|
""" |
|
|
|
ACKNOWLEDGEMENT_TEXT = """ |
|
Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). |
|
""" |
|
|
|
|
|
NOTES_TEXT = """ |
|
* Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data. |
|
* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use [opencompass](https://opencompass.org.cn/home) for evaluation. |
|
* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks. |
|
""" |