Leaderboards and benchmarks ✨

clefourrier 's Collections

LLM evaluation datasets

updated 1 day ago

Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ...

Upvote

Running on CPU Upgrade

11.8k

🏆

Open LLM Leaderboard 2

Track, rank and evaluate open LLMs and chatbots

Note The reference leaderboard for Open LLMs! Find the best LLM for your size and precision needs, compare your models to the others! (Evaluates on ARC, HellaSwag, TruthfulQA, and MMLU)
Running

995

📈

Big Code Models Leaderboard

Note Specialized leaderboard for models with coding capabilities 🖥️ (Evaluates on HumanEval and MultiPL-E)
Running

3.7k

🏆🤖

Chatbot Arena Leaderboard

Note Pitches chatbots against one another to compare their output quality (Evaluates on MTBench, an Elo score, and MMLU)
Running

380

🏆🏋️

LLM-Perf Leaderboard

Note Do you want to know which model to use for which hardware? This leaderboard is for you! (Looks at the throughput of many LLMs in different hardware settings)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Paper • 2210.06413 • Published Oct 12, 2022

Note This paper introduces (among other things) the Eleuther AI Harness, a reference evaluation suite which is simple to use and quite complete!
Holistic Evaluation of Language Models

Paper • 2211.09110 • Published Nov 16, 2022 • 1

Note The HELM paper! A super cool reference paper on the many axes to look at when creating an LLM benchmark or evaluation suite. Super exhaustive and interesting to read.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Paper • 2206.04615 • Published Jun 9, 2022 • 5

Note The BigBench paper! A bunch of tasks to evaluate edge cases and random unusual LLM capabilities. The associated benchmark has since been completed with a lot of fun crowdsourced tasks.
Running on CPU Upgrade

4.2k

🥇

MTEB Leaderboard

Note Text Embeddings benchmark across 58 tasks and 112 languages!
Running on CPU Upgrade

208

🦾

GAIA Leaderboard

Note A leaderboard for tool augmented LLMs!
Running

84

🚀

OpenCompass LLM Leaderboard

Note An LLM leaderboard for Chinese models on many metric axes - super complete
Running on CPU Upgrade

483

📉

Open Ko-LLM Leaderboard

Note An Open LLM Leaderboard specially for Korean models by our friends at Upstage!
Running

35

🐨

Open Dutch LLM Leaderboard

Note An Open LLM Leaderboard specially for Dutch models!
Configuration error

49

⚡

Hallucination Evaluation Leaderboard

Note A leaderboard to evaluate the propensy of LLMs to hallucinate
Running on CPU Upgrade

125

🔥

Hallucinations Leaderboard

Note A lot of metrics if you are interested in the propensity of LLMs to hallucinate!
Running

87

🐠

Nexus Function Calling Leaderboard

Note Tests LLM API usage and calls (few models atm)
Running

53

📈

CyberSecEvalTest

Note How likely is your LLM to help cyber attacks?
Running

181

🌖

Yet Another LLM Leaderboard

Note An aggregation of benchmarks well correlated with human preferences
Running on CPU Upgrade

85

🥇

LLM Safety Leaderboard

Note Bias, safety, toxicity, all those things that are important to test when your chatbot actually interacts with users
Running

32

⚡

EvalCrafter

Note Text to video generation leaderboard
Running

418

🏆

Can Ai Code Results

Note Coding benchmark
Running

95

🏆

Ocrbench Leaderboard

Note An OCR benchmark
Running

53

🥇

NPHardEval Leaderboard

Note Dynamic leaderboard using complexity classes to create reasoning problems for LLMs - quite a cool one
Running

39

💻

Redteaming Resistance Leaderboard

Note Red teaming datasets success against models
Running

18

🏆

Subquadratic LLM Leaderboard

Note The Open LLM Leaderboard, but for structured state models!
Running

516

🖼💬

Vision Arena (Testing VLMs side-by-side)

Note A multimodal arena!
Running

127

📊

VBench Leaderboard
Running on CPU Upgrade

143

🏆

Open Portuguese LLM Leaderboard

Track, rank and evaluate open LLMs in Portuguese

Note An LLM leaderboard for Portuguese
Running on CPU Upgrade

62

🏆

Open Ita Llm Leaderboard

Track, rank and evaluate open LLMs in the italian language!

Note An LLM leaderboard for Italian
Running

7

🏆🇲🇾🤖

Malay LLM Leaderboard

Note An LLM leaderboard for Malay
Running on Zero

249

📈

GenAI Arena

Realtime Image/Video Gen AI Arena

Note An arena for image generation!
Running

8

📊

Q-Bench+ Leaderboard
Running on CPU Upgrade

32

📊

Parti Prompts Leaderboard
Running on CPU Upgrade

80

🥇

HHEM Leaderboard

Note An hallucination leaderboard, focused on a different set of tasks
Restarting on CPU Upgrade

52

🏆🇵🇱

Open PL LLM Leaderboard
Running on CPU Upgrade

85

🥇

OpenLLM Turkish leaderboard
Running

217

🦁

AI2 WildBench Leaderboard (V2)
Running on CPU Upgrade

533

🏆

Open ASR Leaderboard
Running on CPU Upgrade

501

🌎

Open VLM Leaderboard

VLMEvalKit Evaluation Results Collection
Running

276

📐

Reward Bench Leaderboard
Running on CPU Upgrade

504

🏆

TTS Arena

Vote on the latest TTS models!
Running

11

📝

Prompt Injection Detection Benchmark
Running

30

🏟️

Long Code Arena
Running

8

⚡

ML.ENERGY Leaderboard
Running

453

📢

UGI Leaderboard
Configuration error

79

🏃

Berkeley Function Calling Leaderboard
Running on CPU Upgrade

46

🥇

Open CoT Leaderboard

Track, rank and evaluate open LLMs' CoT quality
Running

22

🐑

URIAL Bench (Eval Base LLMs on MT-Bench)
Running

20

🔥

Indic Llm Leaderboard
Sleeping

8

🏆

Meta Open LLM Leaderboard
Running

9

👁

Science Leaderboard

Leaderboard for LLM for Science Reasoning
Running on CPU Upgrade

293

🥇

Open Medical-LLM Leaderboard
Running on CPU Upgrade

29

🥇

Open RL Leaderboard
Running

16

🥇

SeaExam Leaderboard
Running on CPU Upgrade

28

🥇

Hebrew LLM Leaderboard
Running on CPU Upgrade

143

🔬

Open LLM Progress Tracker
Running

146

🏆

Low-bit Quantized Open LLM Leaderboard

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

60

🥇

AIR-Bench Leaderboard
Running on CPU Upgrade

111

🏆

Open Arabic LLM Leaderboard

Track, rank and evaluate open Arabic LLMs and chatbots
Running on CPU Upgrade

94

🏆

Open Chinese LLM Leaderboard
Running

190

🏢

3D Arena
Running

139

🥇

BigCodeBench Leaderboard
Running

16

🥇

Open Tw Llm Leaderboard
Running

76

🦓

Zebra Logic Bench
Running

65

🌍

European Leaderboard
Running

18

📊

🇨🇿 BenCzechMark
Running

34

🥇

Leaderboard
Running

28

🎭

Stick To Your Role! Leaderboard
Running

108

🏆

GPU Poor LLM Arena

Compact LLM Battle Arena: Frugal AI Face-Off!
Running on CPU Upgrade

58

🌸

La Leaderboard

Evaluate open LLMs in the languages of LATAM and Spain.
Running on CPU Upgrade

24

🥇

OpenLLM French leaderboard 🇫🇷
Running

13

🥇

GIFT Eval

GIFT-Eval: A Benchmark for General Time Series Forecasting
Running

46

💻

Judge Arena
Running

31

🏅

Persian LLM Leaderboard

Persian LLM Leaderboard
Running

34

🌖

Japanese Chatbot Arena Leaderboard
Running on CPU Upgrade

17

🌸

Open Japanese LLM Leaderboard

Upvote

Leaderboards and benchmarks ✨

Open LLM Leaderboard 2

Big Code Models Leaderboard

Chatbot Arena Leaderboard

LLM-Perf Leaderboard

MTEB Leaderboard

GAIA Leaderboard

OpenCompass LLM Leaderboard

Open Ko-LLM Leaderboard

Open Dutch LLM Leaderboard

Hallucination Evaluation Leaderboard

Hallucinations Leaderboard

Nexus Function Calling Leaderboard

CyberSecEvalTest

Yet Another LLM Leaderboard

LLM Safety Leaderboard

EvalCrafter

Can Ai Code Results

Ocrbench Leaderboard

NPHardEval Leaderboard

Redteaming Resistance Leaderboard

Subquadratic LLM Leaderboard

Vision Arena (Testing VLMs side-by-side)

VBench Leaderboard

Open Portuguese LLM Leaderboard

Open Ita Llm Leaderboard

Malay LLM Leaderboard

GenAI Arena

Q-Bench+ Leaderboard

Parti Prompts Leaderboard

HHEM Leaderboard

Open PL LLM Leaderboard

OpenLLM Turkish leaderboard

AI2 WildBench Leaderboard (V2)

Open ASR Leaderboard

Open VLM Leaderboard

Reward Bench Leaderboard

TTS Arena

Prompt Injection Detection Benchmark

Long Code Arena

ML.ENERGY Leaderboard

UGI Leaderboard

Berkeley Function Calling Leaderboard

Open CoT Leaderboard

URIAL Bench (Eval Base LLMs on MT-Bench)

Indic Llm Leaderboard

Meta Open LLM Leaderboard

Science Leaderboard

Open Medical-LLM Leaderboard

Open RL Leaderboard

SeaExam Leaderboard

Hebrew LLM Leaderboard

Open LLM Progress Tracker

Low-bit Quantized Open LLM Leaderboard

AIR-Bench Leaderboard

Open Arabic LLM Leaderboard

Open Chinese LLM Leaderboard

3D Arena

BigCodeBench Leaderboard

Open Tw Llm Leaderboard

Zebra Logic Bench

European Leaderboard

🇨🇿 BenCzechMark

Leaderboard

Stick To Your Role! Leaderboard

GPU Poor LLM Arena

La Leaderboard

OpenLLM French leaderboard 🇫🇷

GIFT Eval

Judge Arena

Persian LLM Leaderboard

Japanese Chatbot Arena Leaderboard

Open Japanese LLM Leaderboard