alexmarques commited on
Commit
b4793e6
1 Parent(s): fc3ee56

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +187 -30
README.md CHANGED
@@ -32,8 +32,9 @@ base_model: meta-llama/Meta-Llama-3.1-70B-Instruct
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
- Quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
- It achieves an average score of 84.16 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 84.40.
 
37
 
38
  ### Model Optimizations
39
 
@@ -138,50 +139,73 @@ oneshot(
138
 
139
  ## Evaluation
140
 
141
- The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
142
- Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
143
- This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
 
 
 
144
 
145
  ### Accuracy
146
 
147
- #### Open LLM Leaderboard evaluation scores
148
  <table>
149
  <tr>
150
  <td><strong>Benchmark</strong>
151
  </td>
152
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
153
  </td>
154
- <td><strong>Meta-Llama-3.1-70B-Instruct-FP8(this model)</strong>
155
  </td>
156
  <td><strong>Recovery</strong>
157
  </td>
158
  </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  <tr>
160
  <td>MMLU (5-shot)
161
  </td>
162
- <td>83.83
163
  </td>
164
- <td>83.75
165
  </td>
166
- <td>99.90%
167
  </td>
168
  </tr>
169
  <tr>
170
  <td>MMLU-cot (0-shot)
171
  </td>
172
- <td>86.01
173
  </td>
174
- <td>85.48
175
  </td>
176
- <td>99.38%
177
  </td>
178
  </tr>
179
  <tr>
180
  <td>ARC Challenge (0-shot)
181
  </td>
182
- <td>93.26
183
  </td>
184
- <td>93.52
185
  </td>
186
  <td>100.2%
187
  </td>
@@ -189,51 +213,149 @@ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GS
189
  <tr>
190
  <td>GSM-8K-cot (8-shot, strict-match)
191
  </td>
192
- <td>94.92
193
  </td>
194
- <td>94.54
195
  </td>
196
- <td>99.60%
197
  </td>
198
  </tr>
199
  <tr>
200
  <td>Hellaswag (10-shot)
201
  </td>
202
- <td>86.75
203
  </td>
204
- <td>86.63
205
  </td>
206
- <td>99.86%
207
  </td>
208
  </tr>
209
  <tr>
210
  <td>Winogrande (5-shot)
211
  </td>
212
- <td>85.32
213
  </td>
214
- <td>84.61
215
  </td>
216
- <td>99.17%
217
  </td>
218
  </tr>
219
  <tr>
220
  <td>TruthfulQA (0-shot, mc2)
221
  </td>
222
- <td>60.68
223
  </td>
224
- <td>60.60
225
  </td>
226
- <td>99.87%
227
  </td>
228
  </tr>
229
  <tr>
230
  <td><strong>Average</strong>
231
  </td>
232
- <td><strong>84.40</strong>
 
 
233
  </td>
234
- <td><strong>84.16</strong>
235
  </td>
236
- <td><strong>99.72%</strong>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
  </td>
238
  </tr>
239
  </table>
@@ -314,4 +436,39 @@ lm_eval \
314
  --tasks truthfulqa \
315
  --num_fewshot 0 \
316
  --batch_size auto
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
317
  ```
 
32
  - **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
33
  - **Model Developers:** Neural Magic
34
 
35
+ This model is a quantized version of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
36
+ It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation.
37
+ Meta-Llama-3.1-70B-Instruct-FP8-dynamic achieves 101.6% recovery for the Arena-Hard evaluation, 99.7% for OpenLLM v1 (using Meta's prompting when available), 100.0% for OpenLLM v2, 100.4% for HumanEval pass@1, and 100.3% for HumanEval+ pass@1.
38
 
39
  ### Model Optimizations
40
 
 
139
 
140
  ## Evaluation
141
 
142
+ This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks.
143
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
144
+
145
+ Arena-Hard evaluations were conducted using the [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) repository.
146
+ The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4.
147
+ We report below the scores obtained in each judgement and the average.
148
+
149
+ OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct).
150
+ This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-70B-Instruct-evals) and a few fixes to OpenLLM v2 tasks.
151
+
152
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
153
+
154
+ Detailed model outputs are available as HuggingFace datasets for [Arena-Hard](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-arena-hard-evals), [OpenLLM v2](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-leaderboard-v2-evals), and [HumanEval](https://huggingface.co/datasets/neuralmagic/quantized-llama-3.1-humaneval-evals).
155
 
156
  ### Accuracy
157
 
 
158
  <table>
159
  <tr>
160
  <td><strong>Benchmark</strong>
161
  </td>
162
  <td><strong>Meta-Llama-3.1-70B-Instruct </strong>
163
  </td>
164
+ <td><strong>Meta-Llama-3.1-70B-Instruct-FP8-dynamic (this model)</strong>
165
  </td>
166
  <td><strong>Recovery</strong>
167
  </td>
168
  </tr>
169
+ <tr>
170
+ <td><strong>Arena Hard</strong>
171
+ </td>
172
+ <td>57.0 (55.8 / 58.2)
173
+ </td>
174
+ <td>57.9 (58.1 / 57.7)
175
+ </td>
176
+ <td>101.6%
177
+ </td>
178
+ </tr>
179
+ <tr>
180
+ <td><strong>OpenLLM v1</strong>
181
+ </td>
182
+ </tr>
183
  <tr>
184
  <td>MMLU (5-shot)
185
  </td>
186
+ <td>83.8
187
  </td>
188
+ <td>83.8
189
  </td>
190
+ <td>99.9%
191
  </td>
192
  </tr>
193
  <tr>
194
  <td>MMLU-cot (0-shot)
195
  </td>
196
+ <td>86.0
197
  </td>
198
+ <td>85.5
199
  </td>
200
+ <td>99.4%
201
  </td>
202
  </tr>
203
  <tr>
204
  <td>ARC Challenge (0-shot)
205
  </td>
206
+ <td>93.3
207
  </td>
208
+ <td>93.5
209
  </td>
210
  <td>100.2%
211
  </td>
 
213
  <tr>
214
  <td>GSM-8K-cot (8-shot, strict-match)
215
  </td>
216
+ <td>94.9
217
  </td>
218
+ <td>94.5
219
  </td>
220
+ <td>99.6%
221
  </td>
222
  </tr>
223
  <tr>
224
  <td>Hellaswag (10-shot)
225
  </td>
226
+ <td>86.8
227
  </td>
228
+ <td>86.6
229
  </td>
230
+ <td>99.9%
231
  </td>
232
  </tr>
233
  <tr>
234
  <td>Winogrande (5-shot)
235
  </td>
236
+ <td>85.3
237
  </td>
238
+ <td>84.6
239
  </td>
240
+ <td>99.2%
241
  </td>
242
  </tr>
243
  <tr>
244
  <td>TruthfulQA (0-shot, mc2)
245
  </td>
246
+ <td>60.7
247
  </td>
248
+ <td>60.6
249
  </td>
250
+ <td>99.9%
251
  </td>
252
  </tr>
253
  <tr>
254
  <td><strong>Average</strong>
255
  </td>
256
+ <td><strong>84.4</strong>
257
+ </td>
258
+ <td><strong>84.2</strong>
259
  </td>
260
+ <td><strong>99.7%</strong>
261
  </td>
262
+ </tr>
263
+ <tr>
264
+ <td><strong>OpenLLM v2</strong>
265
+ </td>
266
+ </tr>
267
+ <tr>
268
+ <td>MMLU-Pro (5-shot)
269
+ </td>
270
+ <td>48.1
271
+ </td>
272
+ <td>47.7
273
+ </td>
274
+ <td>99.1%
275
+ </td>
276
+ </tr>
277
+ <tr>
278
+ <td>IFEval (0-shot)
279
+ </td>
280
+ <td>86.4
281
+ </td>
282
+ <td>87.6
283
+ </td>
284
+ <td>101.3%
285
+ </td>
286
+ </tr>
287
+ <tr>
288
+ <td>BBH (3-shot)
289
+ </td>
290
+ <td>55.8
291
+ </td>
292
+ <td>54.9
293
+ </td>
294
+ <td>98.4%
295
+ </td>
296
+ </tr>
297
+ <tr>
298
+ <td>Math-|v|-5 (4-shot)
299
+ </td>
300
+ <td>26.1
301
+ </td>
302
+ <td>28.0
303
+ </td>
304
+ <td>107.5%
305
+ </td>
306
+ </tr>
307
+ <tr>
308
+ <td>GPQA (0-shot)
309
+ </td>
310
+ <td>15.4
311
+ </td>
312
+ <td>14.6
313
+ </td>
314
+ <td>94.7%
315
+ </td>
316
+ </tr>
317
+ <tr>
318
+ <td>MuSR (0-shot)
319
+ </td>
320
+ <td>18.2
321
+ </td>
322
+ <td>17.2
323
+ </td>
324
+ <td>94.5%
325
+ </td>
326
+ </tr>
327
+ <tr>
328
+ <td><strong>Average</strong>
329
+ </td>
330
+ <td><strong>41.7</strong>
331
+ </td>
332
+ <td><strong>41.7</strong>
333
+ </td>
334
+ <td><strong>100.0%</strong>
335
+ </td>
336
+ </tr>
337
+ <tr>
338
+ <td><strong>Coding</strong>
339
+ </td>
340
+ </tr>
341
+ <tr>
342
+ <td>HumanEval pass@1
343
+ </td>
344
+ <td>79.7
345
+ </td>
346
+ <td>80.0
347
+ </td>
348
+ <td>100.4%
349
+ </td>
350
+ </tr>
351
+ <tr>
352
+ <td>HumanEval+ pass@1
353
+ </td>
354
+ <td>74.8
355
+ </td>
356
+ <td>75.0
357
+ </td>
358
+ <td>100.3%
359
  </td>
360
  </tr>
361
  </table>
 
436
  --tasks truthfulqa \
437
  --num_fewshot 0 \
438
  --batch_size auto
439
+ ```
440
+
441
+ #### OpenLLM v2
442
+ ```
443
+ lm_eval \
444
+ --model vllm \
445
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True \
446
+ --apply_chat_template \
447
+ --fewshot_as_multiturn \
448
+ --tasks leaderboard \
449
+ --batch_size auto
450
+ ```
451
+
452
+ #### HumanEval and HumanEval+
453
+ ##### Generation
454
+ ```
455
+ python3 codegen/generate.py \
456
+ --model neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8-dynamic \
457
+ --bs 16 \
458
+ --temperature 0.2 \
459
+ --n_samples 50 \
460
+ --root "." \
461
+ --dataset humaneval \
462
+ --tp 2
463
+ ```
464
+ ##### Sanitization
465
+ ```
466
+ python3 evalplus/sanitize.py \
467
+ humaneval/neuralmagic--Meta-Llama-3.1-70B-Instruct-FP8-dynamic_vllm_temp_0.2
468
+ ```
469
+ ##### Evaluation
470
+ ```
471
+ evalplus.evaluate \
472
+ --dataset humaneval \
473
+ --samples humaneval/neuralmagic--Meta-Llama-3.1-70B-Instruct-FP8-dynamic_vllm_temp_0.2-sanitized
474
  ```