Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Gregor Betz
commited on
Commit
•
c2ba07b
1
Parent(s):
44ef4de
description
Browse files- src/display/about.py +7 -9
src/display/about.py
CHANGED
@@ -28,9 +28,7 @@ TITLE = """<h1 align="center" id="space-title"><code>/\/</code> Open CoT
|
|
28 |
INTRODUCTION_TEXT = """
|
29 |
The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
|
30 |
|
31 |
-
The leaderboard reports **accuracy gains** achieved by using CoT, i.e.
|
32 |
-
|
33 |
-
> _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_.
|
34 |
|
35 |
See the "About" tab for more details and motivation.
|
36 |
"""
|
@@ -39,14 +37,14 @@ See the "About" tab for more details and motivation.
|
|
39 |
LLM_BENCHMARKS_TEXT = """
|
40 |
## How it works
|
41 |
|
42 |
-
|
43 |
|
44 |
-
1.
|
45 |
-
2. Let the model answer the test dataset problems and record the resulting _baseline accuracy_.
|
46 |
-
3. Let the model answer the test dataset problems _with the reasoning traces appended_ to the prompt and record the resulting _CoT accuracy_.
|
47 |
-
4. Compute the _accuracy gain Δ_ =
|
48 |
|
49 |
-
Each regime has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by
|
50 |
|
51 |
|
52 |
## How is it different from other leaderboards?
|
|
|
28 |
INTRODUCTION_TEXT = """
|
29 |
The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
|
30 |
|
31 |
+
The leaderboard reports **accuracy gains** achieved by using CoT, i.e.: _accuracy gain Δ_ = _CoT accuracy_ – _baseline accuracy_.
|
|
|
|
|
32 |
|
33 |
See the "About" tab for more details and motivation.
|
34 |
"""
|
|
|
37 |
LLM_BENCHMARKS_TEXT = """
|
38 |
## How it works
|
39 |
|
40 |
+
To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
|
41 |
|
42 |
+
1. Let the `model` generate CoT reasoning traces for all problems in the test dataset according to `regime`.
|
43 |
+
2. Let the `model` answer the test dataset problems, and record the resulting _baseline accuracy_.
|
44 |
+
3. Let the `model` answer the test dataset problems _with the reasoning traces appended_ to the prompt, and record the resulting _CoT accuracy_.
|
45 |
+
4. Compute the _accuracy gain Δ_ = _CoT accuracy_ – _baseline accuracy_ for the given `model`, `task`, and `regime`.
|
46 |
|
47 |
+
Each `regime` has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by any regime.
|
48 |
|
49 |
|
50 |
## How is it different from other leaderboards?
|