Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Gregor Betz
commited on
Commit
β’
ad554f1
1
Parent(s):
992caee
description
Browse files- src/display/about.py +5 -1
src/display/about.py
CHANGED
@@ -34,7 +34,7 @@ See the "About" tab for more details and motivation.
|
|
34 |
"""
|
35 |
|
36 |
# Which evaluations are you running? how can people reproduce what you have?
|
37 |
-
LLM_BENCHMARKS_TEXT = """
|
38 |
## How it works (roughly)
|
39 |
|
40 |
To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
|
@@ -53,6 +53,10 @@ Performance leaderboards like the [π€ Open LLM Leaderboard](https://huggingfac
|
|
53 |
|
54 |
Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
|
55 |
|
|
|
|
|
|
|
|
|
56 |
### π€ Open LLM Leaderboard
|
57 |
* Can `model` solve `task`?
|
58 |
* Measures `task` performance.
|
|
|
34 |
"""
|
35 |
|
36 |
# Which evaluations are you running? how can people reproduce what you have?
|
37 |
+
LLM_BENCHMARKS_TEXT = f"""
|
38 |
## How it works (roughly)
|
39 |
|
40 |
To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
|
|
|
53 |
|
54 |
Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
|
55 |
|
56 |
+
| Leaderboard | Measures | Metric | Focus |
|
57 |
+
|:---|:---|:---|:---|
|
58 |
+
| π€ Open LLM Leaderboard | Task performance | Absolute accuracy | Task performance |
|
59 |
+
|
60 |
### π€ Open LLM Leaderboard
|
61 |
* Can `model` solve `task`?
|
62 |
* Measures `task` performance.
|