Spaces:

Bowieee
/

StructEval_leaderboard

Running

App Files Files Community

StructEval_leaderboard / text_content.py

Bowieee

add paper

e8461eb 3 months ago

raw

history blame

3.38 kB

	HEAD_TEXT = """
	This is the official leaderboard for 🏅StructEval benchmark. Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.

	Please refer to 🐱[StructEval repository](https://github.com/c-box/StructEval) for model evaluation and 📖[our paper](https://arxiv.org/abs/2408.03281) for experimental analysis.

	🚀 _Latest News_
	* [2024.8.6] We released the first version of StructEval leaderboard, which includes 22 open-sourced language models, more datasets and models are comming soon🔥🔥🔥.

	* [2024.7.31] We regenerated the StructEval Benchmark based on the latest [Wikipedia](https://www.wikipedia.org/) pages (20240601) using [GPT-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) model, which could minimize the impact of data contamination🔥🔥🔥.
	"""

	ABOUT_TEXT = """# What is StructEval?
	Evaluation is the baton for the development of large language models.
	Current evaluations typically employ a single-item assessment paradigm for each atomic test objective, which struggles to discern whether a model genuinely possesses the required capabilities or merely memorizes/guesses the answers to specific questions.
	To this end, we propose a novel evaluation framework referred to as *StructEval*.
	Starting from an atomic test objective, StructEval deepens and broadens the evaluation by conducting a structured assessment across multiple cognitive levels and critical concepts, and therefore offers a comprehensive, robust and consistent evaluation for LLMs.
	Experiments demonstrate that StructEval serves as a reliable tool for resisting the risk of data contamination and reducing the interference of potential biases, thereby providing more reliable and consistent conclusions regarding model capabilities.
	Our framework also sheds light on the design of future principled and trustworthy LLM evaluation protocols.

	# How to evaluate?
	Our 🐱[repo](https://github.com/c-box/StructEval) provides easy-to-use scripts for both evaluating LLMs on existing StructEval benchmarks and generating new benchmarks based on StructEval framework.

	# Contact
	If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]).
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

	CITATION_BUTTON_TEXT = r"""
	comming soon
	"""

	ACKNOWLEDGEMENT_TEXT = """
	Inspired from the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
	"""


	NOTES_TEXT = """
	* Base benchmark refers to the original dataset, while struct benchmarks refer to the benchmarks constructed using StructEval with these base benchmarks as seed data.
	* On most models on base MMLU, we collected the results for their official technical report. For the models that have not been reported, we use [opencompass](https://opencompass.org.cn/home) for evaluation.
	* For other 2 base benchmarks and all 3 structured benchmarks: for chat models, we evaluate them under 0-shot setting; for completion model, we evaluate them under 0-shot setting with ppl. And we keep the prompt format consistent across all benchmarks.
	"""