Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
feat-improve-submission-page-0517 (#10)
Browse files- feat: update the submission and about page (019ad238188be6cb31f63db507eef05c83a55857)
- src/about.py +11 -123
src/about.py
CHANGED
@@ -4,137 +4,25 @@ TITLE = """<h1 align="center" id="space-title">AIR-Bench: Automated Heterogeneou
|
|
4 |
|
5 |
# What does your leaderboard evaluate?
|
6 |
INTRODUCTION_TEXT = """
|
7 |
-
Check more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench)
|
8 |
"""
|
9 |
|
10 |
# Which evaluations are you running? how can people reproduce what you have?
|
11 |
BENCHMARKS_TEXT = f"""
|
12 |
-
## How
|
|
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
EVALUATION_QUEUE_TEXT = """
|
18 |
-
## Steps for submit to AIR-Bench
|
19 |
-
|
20 |
-
1. Install AIR-Bench
|
21 |
-
```bash
|
22 |
-
pip install air-benchmark
|
23 |
-
```
|
24 |
-
2. Run the evaluation script
|
25 |
-
```bash
|
26 |
-
cd AIR-Bench/scripts
|
27 |
-
# Run all tasks
|
28 |
-
python run_air_benchmark.py \\
|
29 |
-
--output_dir ./search_results \\
|
30 |
-
--encoder BAAI/bge-m3 \\
|
31 |
-
--reranker BAAI/bge-reranker-v2-m3 \\
|
32 |
-
--search_top_k 1000 \\
|
33 |
-
--rerank_top_k 100 \\
|
34 |
-
--max_query_length 512 \\
|
35 |
-
--max_passage_length 512 \\
|
36 |
-
--batch_size 512 \\
|
37 |
-
--pooling_method cls \\
|
38 |
-
--normalize_embeddings True \\
|
39 |
-
--use_fp16 True \\
|
40 |
-
--add_instruction False \\
|
41 |
-
--overwrite False
|
42 |
-
|
43 |
-
# Run the tasks in the specified task type
|
44 |
-
python run_air_benchmark.py \\
|
45 |
-
--task_types long-doc \\
|
46 |
-
--output_dir ./search_results \\
|
47 |
-
--encoder BAAI/bge-m3 \\
|
48 |
-
--reranker BAAI/bge-reranker-v2-m3 \\
|
49 |
-
--search_top_k 1000 \\
|
50 |
-
--rerank_top_k 100 \\
|
51 |
-
--max_query_length 512 \\
|
52 |
-
--max_passage_length 512 \\
|
53 |
-
--batch_size 512 \\
|
54 |
-
--pooling_method cls \\
|
55 |
-
--normalize_embeddings True \\
|
56 |
-
--use_fp16 True \\
|
57 |
-
--add_instruction False \\
|
58 |
-
--overwrite False
|
59 |
|
60 |
-
|
61 |
-
|
62 |
-
--task_types long-doc \\
|
63 |
-
--domains arxiv book \\
|
64 |
-
--output_dir ./search_results \\
|
65 |
-
--encoder BAAI/bge-m3 \\
|
66 |
-
--reranker BAAI/bge-reranker-v2-m3 \\
|
67 |
-
--search_top_k 1000 \\
|
68 |
-
--rerank_top_k 100 \\
|
69 |
-
--max_query_length 512 \\
|
70 |
-
--max_passage_length 512 \\
|
71 |
-
--batch_size 512 \\
|
72 |
-
--pooling_method cls \\
|
73 |
-
--normalize_embeddings True \\
|
74 |
-
--use_fp16 True \\
|
75 |
-
--add_instruction False \\
|
76 |
-
--overwrite False
|
77 |
|
78 |
-
|
79 |
-
python run_air_benchmark.py \\
|
80 |
-
--languages en \\
|
81 |
-
--output_dir ./search_results \\
|
82 |
-
--encoder BAAI/bge-m3 \\
|
83 |
-
--reranker BAAI/bge-reranker-v2-m3 \\
|
84 |
-
--search_top_k 1000 \\
|
85 |
-
--rerank_top_k 100 \\
|
86 |
-
--max_query_length 512 \\
|
87 |
-
--max_passage_length 512 \\
|
88 |
-
--batch_size 512 \\
|
89 |
-
--pooling_method cls \\
|
90 |
-
--normalize_embeddings True \\
|
91 |
-
--use_fp16 True \\
|
92 |
-
--add_instruction False \\
|
93 |
-
--overwrite False
|
94 |
-
|
95 |
-
# Run the tasks in the specified task type, domains, and languages
|
96 |
-
python run_air_benchmark.py \\
|
97 |
-
--task_types qa \\
|
98 |
-
--domains wiki web \\
|
99 |
-
--languages en \\
|
100 |
-
--output_dir ./search_results \\
|
101 |
-
--encoder BAAI/bge-m3 \\
|
102 |
-
--reranker BAAI/bge-reranker-v2-m3 \\
|
103 |
-
--search_top_k 1000 \\
|
104 |
-
--rerank_top_k 100 \\
|
105 |
-
--max_query_length 512 \\
|
106 |
-
--max_passage_length 512 \\
|
107 |
-
--batch_size 512 \\
|
108 |
-
--pooling_method cls \\
|
109 |
-
--normalize_embeddings True \\
|
110 |
-
--use_fp16 True \\
|
111 |
-
--add_instruction False \\
|
112 |
-
--overwrite False
|
113 |
-
```
|
114 |
-
3. Package the search results.
|
115 |
-
```bash
|
116 |
-
# Zip "Embedding Model + NoReranker" search results in "<search_results>/<model_name>/NoReranker" to "<save_dir>/<model_name>_NoReranker.zip".
|
117 |
-
python zip_results.py \\
|
118 |
-
--results_dir search_results \\
|
119 |
-
--model_name bge-m3 \\
|
120 |
-
--save_dir search_results/zipped_results
|
121 |
-
|
122 |
-
# Zip "Embedding Model + Reranker" search results in "<search_results>/<model_name>/<reranker_name>" to "<save_dir>/<model_name>_<reranker_name>.zip".
|
123 |
-
python zip_results.py \\
|
124 |
-
--results_dir search_results \\
|
125 |
-
--model_name bge-m3 \\
|
126 |
-
--reranker_name bge-reranker-v2-m3 \\
|
127 |
-
--save_dir search_results/zipped_results
|
128 |
-
```
|
129 |
-
4. Upload the `.zip` file on this page and fill in the model information:
|
130 |
-
- Model Name: such as `bge-m3`.
|
131 |
-
- Model URL: such as `https://huggingface.co/BAAI/bge-m3`.
|
132 |
-
- Reranker Name: such as `bge-reranker-v2-m3`. Keep empty for `NoReranker`.
|
133 |
-
- Reranker URL: such as `https://huggingface.co/BAAI/bge-reranker-v2-m3`. Keep empty for `NoReranker`.
|
134 |
-
|
135 |
-
If you want to stay anonymous, you can only fill in the Model Name and Reranker Name (keep empty for `NoReranker`), and check the selection box below befor submission.
|
136 |
|
137 |
-
|
|
|
138 |
"""
|
139 |
|
140 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
|
|
4 |
|
5 |
# What does your leaderboard evaluate?
|
6 |
INTRODUCTION_TEXT = """
|
7 |
+
## Check more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench)
|
8 |
"""
|
9 |
|
10 |
# Which evaluations are you running? how can people reproduce what you have?
|
11 |
BENCHMARKS_TEXT = f"""
|
12 |
+
## How the test data are generated?
|
13 |
+
### Find more information at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/data_generation.md)
|
14 |
|
15 |
+
## FAQ
|
16 |
+
- Q: Will you release a new version of datasets regularly? How often will AIR-Bench release a new version?
|
17 |
+
- A: Yes, we plan to release new datasets on regular basis. However, the update frequency is to be decided.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
+
- Q: As you are using models to do the quality control when generating the data, is it biased to the models that are used?
|
20 |
+
- A: Yes, the results is biased to the chosen models. However, we believe the datasets labeled by human are also biased to the human's preference. The key point to verify is whether the model's bias is consistent with the human's. We use our approach to generate test data using the well established MSMARCO datasets. We benchmark different models' performances using the generated dataset and the human-label DEV dataset. Comparing the ranking of different models on these two datasets, we observe the spearman correlation between them is 0.8211 (p-value=5e-5). This indicates that the models' perference is well aligned with the human. Please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/available_evaluation_results.md#consistency-with-ms-marco) for details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
+
EVALUATION_QUEUE_TEXT = """
|
25 |
+
## Check out the submission steps at [our GitHub repo](https://github.com/AIR-Bench/AIR-Bench/blob/main/docs/submit_to_leaderboard.md)
|
26 |
"""
|
27 |
|
28 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|