chiliu
commited on
Commit
•
b83fa45
1
Parent(s):
c897e70
fix
Browse files
README.md
CHANGED
@@ -175,30 +175,30 @@ We evaluated OpenLLaMA on a wide range of tasks using [lm-evaluation-harness](ht
|
|
175 |
The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens. We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
|
176 |
|
177 |
|
178 |
-
| **Task/Metric** |
|
179 |
-
| ---------------------- | -------- |
|
180 |
-
| anli_r1/acc | **0.35**
|
181 |
-
| anli_r2/acc | 0.33
|
182 |
-
| anli_r3/acc | 0.35 | 0.
|
183 |
-
| arc_challenge/acc | 0.35
|
184 |
-
| arc_challenge/acc_norm | 0.37 | 0.
|
185 |
-
| arc_easy/acc | 0.71
|
186 |
-
| arc_easy/acc_norm | 0.65 | 0.
|
187 |
-
| boolq/acc | **0.72**
|
188 |
-
| hellaswag/acc | 0.49
|
189 |
-
| hellaswag/acc_norm | 0.66 | 0.
|
190 |
-
| openbookqa/acc | 0.26 | 0.
|
191 |
-
| openbookqa/acc_norm | 0.40 | 0.
|
192 |
-
| piqa/acc | 0.76
|
193 |
-
| piqa/acc_norm | 0.76 | 0.
|
194 |
-
| record/em | 0.88 | 0.
|
195 |
-
| record/f1 | 0.88 | 0.
|
196 |
-
| rte/acc | 0.55 | 0.
|
197 |
-
| truthfulqa_mc/mc1 | **0.27**
|
198 |
-
| truthfulqa_mc/mc2 | **0.37**
|
199 |
-
| wic/acc | 0.49
|
200 |
-
| winogrande/acc | 0.63
|
201 |
-
| Average | 0.53
|
202 |
|
203 |
|
204 |
We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.
|
|
|
175 |
The original LLaMA model was trained for 1 trillion tokens and GPT-J was trained for 500 billion tokens. We present the results in the table below. OpenLLaMA exhibits comparable performance to the original LLaMA and GPT-J across a majority of tasks, and outperforms them in some tasks.
|
176 |
|
177 |
|
178 |
+
| **Task/Metric** | finetuned-GPT 3B | OpenLLaMA 3B |
|
179 |
+
| ---------------------- | -------- | ------------ |
|
180 |
+
| anli_r1/acc | **0.35** | 0.33 |
|
181 |
+
| anli_r2/acc | **0.33** | 0.32 |
|
182 |
+
| anli_r3/acc | 0.35 | 0.35 |
|
183 |
+
| arc_challenge/acc | **0.35** | 0.34 |
|
184 |
+
| arc_challenge/acc_norm | 0.37 | 0.37 |
|
185 |
+
| arc_easy/acc | **0.71** | 0.69 |
|
186 |
+
| arc_easy/acc_norm | 0.65 | 0.65 |
|
187 |
+
| boolq/acc | **0.72** | 0.66 |
|
188 |
+
| hellaswag/acc | **0.49** | 0.43 |
|
189 |
+
| hellaswag/acc_norm | 0.66 | 0.67 |
|
190 |
+
| openbookqa/acc | 0.26 | 0.27 |
|
191 |
+
| openbookqa/acc_norm | 0.40 | 0.40 |
|
192 |
+
| piqa/acc | **0.76** | 0.75 |
|
193 |
+
| piqa/acc_norm | 0.76 | 0.76 |
|
194 |
+
| record/em | 0.88 | 0.88 |
|
195 |
+
| record/f1 | 0.88 | 0.89 |
|
196 |
+
| rte/acc | 0.55 | 0.58 |
|
197 |
+
| truthfulqa_mc/mc1 | **0.27** | 0.22 |
|
198 |
+
| truthfulqa_mc/mc2 | **0.37** | 0.35 |
|
199 |
+
| wic/acc | **0.49** | 0.48 |
|
200 |
+
| winogrande/acc | **0.63** | 0.62 |
|
201 |
+
| Average | **0.53** | 0.52 |
|
202 |
|
203 |
|
204 |
We removed the task CB and WSC from our benchmark, as our model performs suspiciously well on these two tasks. We hypothesize that there could be a benchmark data contamination in the training set.
|