Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
•
64e7ccb
1
Parent(s):
48843fe
Better About tab, fetch leaderboard date from git
Browse files- LEADERBOARD.md +18 -0
- app.py +14 -8
- data/{2023-07-05/A100_chat-concise_benchmark.csv → A100_chat-concise_benchmark.csv} +0 -0
- data/{2023-07-05/A100_chat_benchmark.csv → A100_chat_benchmark.csv} +0 -0
- data/{2023-07-05/A100_instruct-concise_benchmark.csv → A100_instruct-concise_benchmark.csv} +0 -0
- data/{2023-07-05/A100_instruct_benchmark.csv → A100_instruct_benchmark.csv} +0 -0
- data/{2023-07-05/A40_chat-concise_benchmark.csv → A40_chat-concise_benchmark.csv} +0 -0
- data/{2023-07-05/A40_chat_benchmark.csv → A40_chat_benchmark.csv} +0 -0
- data/{2023-07-05/A40_instruct-concise_benchmark.csv → A40_instruct-concise_benchmark.csv} +0 -0
- data/{2023-07-05/A40_instruct_benchmark.csv → A40_instruct_benchmark.csv} +0 -0
- data/{2023-07-05/V100_chat-concise_benchmark.csv → V100_chat-concise_benchmark.csv} +0 -0
- data/{2023-07-05/V100_chat_benchmark.csv → V100_chat_benchmark.csv} +0 -0
- data/{2023-07-05/V100_instruct-concise_benchmark.csv → V100_instruct-concise_benchmark.csv} +0 -0
- data/{2023-07-05/V100_instruct_benchmark.csv → V100_instruct_benchmark.csv} +0 -0
- data/{2023-07-05/models.json → models.json} +0 -0
- data/{2023-07-05/schema.yaml → schema.yaml} +0 -0
- data/{2023-07-05/score.csv → score.csv} +0 -0
LEADERBOARD.md
CHANGED
@@ -65,14 +65,32 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
65 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
66 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
## Limitations
|
69 |
|
70 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
71 |
Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
|
72 |
|
|
|
|
|
|
|
|
|
|
|
73 |
## Upcoming
|
74 |
|
75 |
- Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
|
|
|
76 |
- More optimized inference runtimes, like TensorRT.
|
77 |
- Larger models with distributed inference, like Falcon 40B.
|
78 |
- More models, like RWKV.
|
|
|
65 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
66 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
67 |
|
68 |
+
## Contributing
|
69 |
+
|
70 |
+
Any kind of contribution is more than welcome!
|
71 |
+
Please look around our [repository](https://github.com/ml-energy/leaderboard).
|
72 |
+
|
73 |
+
Especially, if you want to see a specific model on the leaderboard, please consider adding support to the model.
|
74 |
+
We'll consider running those on the hardware we have.
|
75 |
+
First, see if the model is available in Hugging Face Hub and compatible with lm-evaluation-harness.
|
76 |
+
Then, in our [`benchmark.py`](https://github.com/ml-energy/leaderboard/blob/master/scripts/benchmark.py), implement a way to load the weights of the model and run generative inference.
|
77 |
+
|
78 |
+
Currently, we use FastChat to load models and run inference, but we'll eventually abstract the model executor, making it easier to add models that FastChat does not support.
|
79 |
+
|
80 |
## Limitations
|
81 |
|
82 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
83 |
Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
|
84 |
|
85 |
+
Batch size 1, in some sense, is the lowest possible hardware utilization.
|
86 |
+
We'll soon benchmark batch sizes larger than 1 without continuous batching for comparison.
|
87 |
+
This would show what happens in the case of very high hardware utilization (lest with PyTorch), assuming an ideal case where all sequences in each batch generates the same number of output tokens.
|
88 |
+
By doing this, we can provide numbers for reasonable comparison without being tied to any existing generative model serving system.
|
89 |
+
|
90 |
## Upcoming
|
91 |
|
92 |
- Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
|
93 |
+
- Batched inference
|
94 |
- More optimized inference runtimes, like TensorRT.
|
95 |
- Larger models with distributed inference, like Falcon 40B.
|
96 |
- More models, like RWKV.
|
app.py
CHANGED
@@ -1,10 +1,11 @@
|
|
1 |
from __future__ import annotations
|
2 |
|
3 |
-
import os
|
4 |
import json
|
5 |
import yaml
|
|
|
6 |
import itertools
|
7 |
import contextlib
|
|
|
8 |
|
9 |
import numpy as np
|
10 |
import gradio as gr
|
@@ -158,7 +159,7 @@ class TableManager:
|
|
158 |
gr.Dropdown.update(choices=["None", *columns]),
|
159 |
]
|
160 |
|
161 |
-
def set_filter_get_df(self, *filters):
|
162 |
"""Set the current set of filters and return the filtered DataFrame."""
|
163 |
# If the filter is empty, we default to the first choice for each key.
|
164 |
if not filters:
|
@@ -200,16 +201,21 @@ class TableManager:
|
|
200 |
return fig, width, height, ""
|
201 |
|
202 |
|
203 |
-
# Find the latest version of the CSV files in data/
|
204 |
-
# and initialize the global TableManager.
|
205 |
-
latest_date = sorted(os.listdir("data/"))[-1]
|
206 |
-
|
207 |
# The global instance of the TableManager should only be used when
|
208 |
# initializing components in the Gradio interface. If the global instance
|
209 |
# is mutated while handling user sessions, the change will be reflected
|
210 |
# in every user session. Instead, the instance provided by gr.State should
|
211 |
# be used.
|
212 |
-
global_tbm = TableManager(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
213 |
|
214 |
# Custom JS.
|
215 |
# XXX: This is a hack to make the model names clickable.
|
@@ -397,7 +403,7 @@ with block:
|
|
397 |
|
398 |
# Block 5: Leaderboard date.
|
399 |
with gr.Row():
|
400 |
-
gr.HTML(f"<h3 style='color: gray'>Date: {
|
401 |
|
402 |
# Tab 2: About page.
|
403 |
with gr.TabItem("About"):
|
|
|
1 |
from __future__ import annotations
|
2 |
|
|
|
3 |
import json
|
4 |
import yaml
|
5 |
+
import subprocess
|
6 |
import itertools
|
7 |
import contextlib
|
8 |
+
from dateutil import parser
|
9 |
|
10 |
import numpy as np
|
11 |
import gradio as gr
|
|
|
159 |
gr.Dropdown.update(choices=["None", *columns]),
|
160 |
]
|
161 |
|
162 |
+
def set_filter_get_df(self, *filters) -> pd.DataFrame:
|
163 |
"""Set the current set of filters and return the filtered DataFrame."""
|
164 |
# If the filter is empty, we default to the first choice for each key.
|
165 |
if not filters:
|
|
|
201 |
return fig, width, height, ""
|
202 |
|
203 |
|
|
|
|
|
|
|
|
|
204 |
# The global instance of the TableManager should only be used when
|
205 |
# initializing components in the Gradio interface. If the global instance
|
206 |
# is mutated while handling user sessions, the change will be reflected
|
207 |
# in every user session. Instead, the instance provided by gr.State should
|
208 |
# be used.
|
209 |
+
global_tbm = TableManager("data")
|
210 |
+
|
211 |
+
# Run git log to get the latest commit date.
|
212 |
+
proc = subprocess.run(
|
213 |
+
["git", "log", "-1", "--format=%cd"],
|
214 |
+
stdout=subprocess.PIPE,
|
215 |
+
stderr=subprocess.PIPE,
|
216 |
+
encoding="utf-8",
|
217 |
+
)
|
218 |
+
current_date = parser.parse(proc.stdout.strip()).strftime("%Y-%m-%d")
|
219 |
|
220 |
# Custom JS.
|
221 |
# XXX: This is a hack to make the model names clickable.
|
|
|
403 |
|
404 |
# Block 5: Leaderboard date.
|
405 |
with gr.Row():
|
406 |
+
gr.HTML(f"<h3 style='color: gray'>Date: {current_date}</h3>")
|
407 |
|
408 |
# Tab 2: About page.
|
409 |
with gr.TabItem("About"):
|
data/{2023-07-05/A100_chat-concise_benchmark.csv → A100_chat-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A100_chat_benchmark.csv → A100_chat_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A100_instruct-concise_benchmark.csv → A100_instruct-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A100_instruct_benchmark.csv → A100_instruct_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A40_chat-concise_benchmark.csv → A40_chat-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A40_chat_benchmark.csv → A40_chat_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A40_instruct-concise_benchmark.csv → A40_instruct-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/A40_instruct_benchmark.csv → A40_instruct_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/V100_chat-concise_benchmark.csv → V100_chat-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/V100_chat_benchmark.csv → V100_chat_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/V100_instruct-concise_benchmark.csv → V100_instruct-concise_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/V100_instruct_benchmark.csv → V100_instruct_benchmark.csv}
RENAMED
File without changes
|
data/{2023-07-05/models.json → models.json}
RENAMED
File without changes
|
data/{2023-07-05/schema.yaml → schema.yaml}
RENAMED
File without changes
|
data/{2023-07-05/score.csv → score.csv}
RENAMED
File without changes
|