👋 join us on Discord and WeChat
## 📣 OpenCompass 2023 LLM Annual Leaderboard We are honored to have witnessed the tremendous progress of artificial general intelligence together with the community in the past year, and we are also very pleased that **OpenCompass** can help numerous developers and users. We announce the launch of the **OpenCompass 2023 LLM Annual Leaderboard** plan. We expect to release the annual leaderboard of the LLMs in January 2024, systematically evaluating the performance of LLMs in various capabilities such as language, knowledge, reasoning, creation, long-text, and agents. At that time, we will release rankings for both open-source models and commercial API models, aiming to provide a comprehensive, objective, and neutral reference for the industry and research community. We sincerely invite various large models to join the OpenCompass to showcase their performance advantages in different fields. At the same time, we also welcome researchers and developers to provide valuable suggestions and contributions to jointly promote the development of the LLMs. If you have any questions or needs, please feel free to [contact us](mailto:opencompass@pjlab.org.cn). In addition, relevant evaluation contents, performance statistics, and evaluation methods will be open-source along with the leaderboard release. We have provided the more details of the CompassBench 2023 in [Doc](docs/zh_cn/advanced_guides/compassbench_intro.md). Let's look forward to the release of the OpenCompass 2023 LLM Annual Leaderboard! ## 🧭 Welcome to **OpenCompass**! Just like a compass guides us on our journey, OpenCompass will guide you through the complex landscape of evaluating large language models. With its powerful algorithms and intuitive interface, OpenCompass makes it easy to assess the quality and effectiveness of your NLP models. 🚩🚩🚩 Explore opportunities at OpenCompass! We're currently **hiring full-time researchers/engineers and interns**. If you're passionate about LLM and OpenCompass, don't hesitate to reach out to us via [email](mailto:zhangsongyang@pjlab.org.cn). We'd love to hear from you! 🔥🔥🔥 We are delighted to announce that **the OpenCompass has been recommended by the Meta AI**, click [Get Started](https://ai.meta.com/llama/get-started/#validation) of Llama for more information. > **Attention**Language | Knowledge | Reasoning | Examination |
Word Definition- WiC - SummEditsIdiom Learning- CHIDSemantic Similarity- AFQMC - BUSTMCoreference Resolution- CLUEWSC - WSC - WinoGrandeTranslation- Flores - IWSLT2017Multi-language Question Answering- TyDi-QA - XCOPAMulti-language Summary- XLSum |
Knowledge Question Answering- BoolQ - CommonSenseQA - NaturalQuestions - TriviaQA |
Textual Entailment- CMNLI - OCNLI - OCNLI_FC - AX-b - AX-g - CB - RTE - ANLICommonsense Reasoning- StoryCloze - COPA - ReCoRD - HellaSwag - PIQA - SIQAMathematical Reasoning- MATH - GSM8KTheorem Application- TheoremQA - StrategyQA - SciBenchComprehensive Reasoning- BBH |
Junior High, High School, University, Professional Examinations- C-Eval - AGIEval - MMLU - GAOKAO-Bench - CMMLU - ARC - XiezhiMedical Examinations- CMB |
Understanding | Long Context | Safety | Code |
Reading Comprehension- C3 - CMRC - DRCD - MultiRC - RACE - DROP - OpenBookQA - SQuAD2.0Content Summary- CSL - LCSTS - XSum - SummScreenContent Analysis- EPRSTMT - LAMBADA - TNEWS |
Long Context Understanding- LEval - LongBench - GovReports - NarrativeQA - Qasper |
Safety- CivilComments - CrowsPairs - CValues - JigsawMultilingual - TruthfulQARobustness- AdvGLUE |
Code- HumanEval - HumanEvalX - MBPP - APPs - DS1000 |
Open-source Models | API Models |
- [InternLM](https://github.com/InternLM/InternLM) - [LLaMA](https://github.com/facebookresearch/llama) - [Vicuna](https://github.com/lm-sys/FastChat) - [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) - [Baichuan](https://github.com/baichuan-inc) - [WizardLM](https://github.com/nlpxucan/WizardLM) - [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B) - [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B) - [TigerBot](https://github.com/TigerResearch/TigerBot) - [Qwen](https://github.com/QwenLM/Qwen) - [BlueLM](https://github.com/vivo-ai-lab/BlueLM) - ... | - OpenAI - Claude - ZhipuAI(ChatGLM) - Baichuan - ByteDance(YunQue) - Huawei(PanGu) - 360 - Baidu(ERNIEBot) - MiniMax(ABAB-Chat) - SenseTime(nova) - Xunfei(Spark) - …… |