Spaces:

ml-energy
/

leaderboard

Running

App Files Files Community

Jae-Won Chung commited on Jul 5, 2023

Commit

64e7ccb

•

1 Parent(s): 48843fe

Better About tab, fetch leaderboard date from git

Browse files

Files changed (17) hide show

LEADERBOARD.md +18 -0
app.py +14 -8
data/{2023-07-05/A100_chat-concise_benchmark.csv → A100_chat-concise_benchmark.csv} +0 -0
data/{2023-07-05/A100_chat_benchmark.csv → A100_chat_benchmark.csv} +0 -0
data/{2023-07-05/A100_instruct-concise_benchmark.csv → A100_instruct-concise_benchmark.csv} +0 -0
data/{2023-07-05/A100_instruct_benchmark.csv → A100_instruct_benchmark.csv} +0 -0
data/{2023-07-05/A40_chat-concise_benchmark.csv → A40_chat-concise_benchmark.csv} +0 -0
data/{2023-07-05/A40_chat_benchmark.csv → A40_chat_benchmark.csv} +0 -0
data/{2023-07-05/A40_instruct-concise_benchmark.csv → A40_instruct-concise_benchmark.csv} +0 -0
data/{2023-07-05/A40_instruct_benchmark.csv → A40_instruct_benchmark.csv} +0 -0
data/{2023-07-05/V100_chat-concise_benchmark.csv → V100_chat-concise_benchmark.csv} +0 -0
data/{2023-07-05/V100_chat_benchmark.csv → V100_chat_benchmark.csv} +0 -0
data/{2023-07-05/V100_instruct-concise_benchmark.csv → V100_instruct-concise_benchmark.csv} +0 -0
data/{2023-07-05/V100_instruct_benchmark.csv → V100_instruct_benchmark.csv} +0 -0
data/{2023-07-05/models.json → models.json} +0 -0
data/{2023-07-05/schema.yaml → schema.yaml} +0 -0
data/{2023-07-05/score.csv → score.csv} +0 -0

LEADERBOARD.md CHANGED Viewed

@@ -65,14 +65,32 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
 We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
 See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
 ## Limitations
 Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
 Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
 ## Upcoming
 - Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
 - More optimized inference runtimes, like TensorRT.
 - Larger models with distributed inference, like Falcon 40B.
 - More models, like RWKV.

 We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
 See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
+## Contributing
+Any kind of contribution is more than welcome!
+Please look around our [repository](https://github.com/ml-energy/leaderboard).
+Especially, if you want to see a specific model on the leaderboard, please consider adding support to the model.
+We'll consider running those on the hardware we have.
+First, see if the model is available in Hugging Face Hub and compatible with lm-evaluation-harness.
+Then, in our [`benchmark.py`](https://github.com/ml-energy/leaderboard/blob/master/scripts/benchmark.py), implement a way to load the weights of the model and run generative inference.
+Currently, we use FastChat to load models and run inference, but we'll eventually abstract the model executor, making it easier to add models that FastChat does not support.
 ## Limitations
 Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
 Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
+Batch size 1, in some sense, is the lowest possible hardware utilization.
+We'll soon benchmark batch sizes larger than 1 without continuous batching for comparison.
+This would show what happens in the case of very high hardware utilization (lest with PyTorch), assuming an ideal case where all sequences in each batch generates the same number of output tokens.
+By doing this, we can provide numbers for reasonable comparison without being tied to any existing generative model serving system.
 ## Upcoming
 - Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
+- Batched inference
 - More optimized inference runtimes, like TensorRT.
 - Larger models with distributed inference, like Falcon 40B.
 - More models, like RWKV.

app.py CHANGED Viewed

@@ -1,10 +1,11 @@
 from __future__ import annotations
-import os
 import json
 import yaml
 import itertools
 import contextlib
 import numpy as np
 import gradio as gr
@@ -158,7 +159,7 @@ class TableManager:
             gr.Dropdown.update(choices=["None", *columns]),
         ]
-    def set_filter_get_df(self, *filters):
         """Set the current set of filters and return the filtered DataFrame."""
         # If the filter is empty, we default to the first choice for each key.
         if not filters:
@@ -200,16 +201,21 @@ class TableManager:
         return fig, width, height, ""
-# Find the latest version of the CSV files in data/
-# and initialize the global TableManager.
-latest_date = sorted(os.listdir("data/"))[-1]
 # The global instance of the TableManager should only be used when
 # initializing components in the Gradio interface. If the global instance
 # is mutated while handling user sessions, the change will be reflected
 # in every user session. Instead, the instance provided by gr.State should
 # be used.
-global_tbm = TableManager(f"data/{latest_date}")
 # Custom JS.
 # XXX: This is a hack to make the model names clickable.
@@ -397,7 +403,7 @@ with block:
             # Block 5: Leaderboard date.
             with gr.Row():
-                gr.HTML(f"<h3 style='color: gray'>Date: {latest_date}</h3>")
         # Tab 2: About page.
         with gr.TabItem("About"):

 from __future__ import annotations
 import json
 import yaml
+import subprocess
 import itertools
 import contextlib
+from dateutil import parser
 import numpy as np
 import gradio as gr
             gr.Dropdown.update(choices=["None", *columns]),
         ]
+    def set_filter_get_df(self, *filters) -> pd.DataFrame:
         """Set the current set of filters and return the filtered DataFrame."""
         # If the filter is empty, we default to the first choice for each key.
         if not filters:
         return fig, width, height, ""
 # The global instance of the TableManager should only be used when
 # initializing components in the Gradio interface. If the global instance
 # is mutated while handling user sessions, the change will be reflected
 # in every user session. Instead, the instance provided by gr.State should
 # be used.
+global_tbm = TableManager("data")
+# Run git log to get the latest commit date.
+proc = subprocess.run(
+    ["git", "log", "-1", "--format=%cd"],
+    stdout=subprocess.PIPE,
+    stderr=subprocess.PIPE,
+    encoding="utf-8",
+)
+current_date = parser.parse(proc.stdout.strip()).strftime("%Y-%m-%d")
 # Custom JS.
 # XXX: This is a hack to make the model names clickable.
             # Block 5: Leaderboard date.
             with gr.Row():
+                gr.HTML(f"<h3 style='color: gray'>Date: {current_date}</h3>")
         # Tab 2: About page.
         with gr.TabItem("About"):

data/{2023-07-05/A100_chat-concise_benchmark.csv → A100_chat-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A100_chat_benchmark.csv → A100_chat_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A100_instruct-concise_benchmark.csv → A100_instruct-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A100_instruct_benchmark.csv → A100_instruct_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A40_chat-concise_benchmark.csv → A40_chat-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A40_chat_benchmark.csv → A40_chat_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A40_instruct-concise_benchmark.csv → A40_instruct-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/A40_instruct_benchmark.csv → A40_instruct_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/V100_chat-concise_benchmark.csv → V100_chat-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/V100_chat_benchmark.csv → V100_chat_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/V100_instruct-concise_benchmark.csv → V100_instruct-concise_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/V100_instruct_benchmark.csv → V100_instruct_benchmark.csv} RENAMED Viewed

File without changes

data/{2023-07-05/models.json → models.json} RENAMED Viewed

File without changes

data/{2023-07-05/schema.yaml → schema.yaml} RENAMED Viewed

File without changes

data/{2023-07-05/score.csv → score.csv} RENAMED Viewed

File without changes