Update README.md. Added Korean benchmarks.

#38
Files changed (1) hide show
  1. README.md +97 -1
README.md CHANGED
@@ -276,4 +276,100 @@ Note that by default, the Phi-3.5-MoE-instruct model uses flash attention, which
276
  The model is licensed under the [MIT license](./LICENSE).
277
 
278
  ## Trademarks
279
- This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
  The model is licensed under the [MIT license](./LICENSE).
277
 
278
  ## Trademarks
279
+ This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
280
+
281
+
282
+ ## Appendix A: Korean benchmarks
283
+
284
+ The prompt is the same as the [CLIcK paper](https://arxiv.org/abs/2403.06412) prompt. The experimental results below were given with max_tokens=512 (zero-shot), max_tokens=1024 (5-shot), temperature=0.01. No system prompt used.
285
+
286
+ - GPT-4o: 2024-05-13 version
287
+ - GPT-4o-mini: 2024-07-18 version
288
+ - GPT-4-turbo: 2024-04-09 version
289
+ - GPT-3.5-turbo: 2023-06-13 version
290
+
291
+ | Benchmarks | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
292
+ |:-------------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
293
+ | CLIcK | 56.44 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 |
294
+ | HAERAE 1.0 | 61.83 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 |
295
+ | KMMLU (0-shot, CoT) | 47.43 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 |
296
+ | KMMLU (5-shot) | 47.92 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 |
297
+ | KMMLU-HARD (0-shot, CoT) | 25.34 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 |
298
+ | KMMLU-HARD (5-shot) | 25.66 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 |
299
+ | Average | 45.82 | 29.99 | 29.29 | 62.54 | 50.08 | 56.74 | 39.61 |
300
+
301
+ #### CLIcK (Cultural and Linguistic Intelligence in Korean)
302
+
303
+ ##### Accuracy by supercategory
304
+ | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
305
+ |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
306
+ | Culture | 58.44 | 29.74 | 51.15 | 81.89 | 70.95 | 73.61 | 53.38 |
307
+ | Language | 52.31 | 27.85 | 40.92 | 77.54 | 63.54 | 71.23 | 46 |
308
+ | **Overall** | 56.44 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 |
309
+
310
+ ##### Accuracy by category
311
+ | supercategory | category | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
312
+ |:----------------|:------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
313
+ | Culture | Economy | 77.97 | 28.81 | 66.1 | 94.92 | 83.05 | 89.83 | 64.41 |
314
+ | Culture | Geography | 60.31 | 29.01 | 54.2 | 80.15 | 77.86 | 82.44 | 53.44 |
315
+ | Culture | History | 33.93 | 30 | 29.64 | 66.92 | 48.4 | 46.4 | 31.79 |
316
+ | Culture | Law | 52.51 | 22.83 | 44.29 | 70.78 | 57.53 | 61.19 | 41.55 |
317
+ | Culture | Politics | 70.24 | 33.33 | 59.52 | 88.1 | 83.33 | 89.29 | 65.48 |
318
+ | Culture | Pop Culture | 80.49 | 34.15 | 60.98 | 97.56 | 85.37 | 92.68 | 75.61 |
319
+ | Culture | Society | 74.43 | 31.72 | 65.05 | 92.88 | 85.44 | 86.73 | 71.2 |
320
+ | Culture | Tradition | 58.11 | 31.98 | 54.95 | 87.39 | 74.77 | 79.28 | 55.86 |
321
+ | Language | Functional | 48 | 24 | 32.8 | 84.8 | 64.8 | 80 | 40 |
322
+ | Language | Grammar | 29.58 | 23.33 | 22.92 | 57.08 | 42.5 | 47.5 | 30 |
323
+ | Language | Textual | 73.33 | 33.33 | 59.65 | 91.58 | 80.7 | 87.37 | 62.11 |
324
+
325
+ #### HAERAE 1.0
326
+
327
+ | category | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
328
+ |:----------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
329
+ | General Knowledge | 39.77 | 28.41 | 34.66 | 77.27 | 53.41 | 66.48 | 40.91 |
330
+ | History | 60.64 | 22.34 | 44.15 | 92.02 | 84.57 | 78.72 | 30.32 |
331
+ | Loan Words | 70.41 | 35.5 | 63.31 | 79.88 | 76.33 | 78.11 | 59.17 |
332
+ | Rare Words | 63.95 | 42.96 | 63.21 | 87.9 | 81.98 | 79.01 | 61.23 |
333
+ | Reading Comprehension | 64.43 | 41.16 | 51.9 | 85.46 | 77.18 | 80.09 | 56.15 |
334
+ | Standard Nomenclature | 66.01 | 32.68 | 58.82 | 88.89 | 75.82 | 79.08 | 53.59 |
335
+ | **Overall** | 61.83 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 |
336
+
337
+ #### KMMLU (0-shot, CoT)
338
+
339
+ | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
340
+ |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
341
+ | Applied Science | 45.15 | 31.68 | 37.03 | 61.52 | 49.29 | 55.98 | 38.47 |
342
+ | HUMSS | 49.75 | 26.47 | 37.29 | 69.45 | 56.59 | 63 | 40.9 |
343
+ | Other | 47.24 | 31.01 | 39.15 | 63.79 | 52.35 | 57.53 | 40.19 |
344
+ | STEM | 49.08 | 31.9 | 40.42 | 65.16 | 54.74 | 60.84 | 42.24 |
345
+ | **Overall** | 47.43 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 |
346
+
347
+ #### KMMLU (5-shot)
348
+
349
+ | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
350
+ |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
351
+ | Applied Science | 45.9 | 29.98 | 19.24 | 61.47 | 48.66 | 56.85 | 40.22 |
352
+ | HUMSS | 49.18 | 27.27 | 22.5 | 68.79 | 55.95 | 63.68 | 43.35 |
353
+ | Other | 48.43 | 30.76 | 20.95 | 64.21 | 51.1 | 57.85 | 41.92 |
354
+ | STEM | 49.21 | 30.73 | 19.55 | 65.28 | 53.29 | 61.08 | 44.43 |
355
+ | **Overall** | 47.92 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 |
356
+
357
+ #### KMMLU-HARD (0-shot, CoT)
358
+
359
+ | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024)| Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
360
+ |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
361
+ | Applied Science | 25.83 | 26.17 | 26.25 | 37.12 | 22.25 | 29.17 | 21.07 |
362
+ | HUMSS | 21.52 | 24.38 | 20.21 | 41.97 | 23.31 | 31.51 | 19.44 |
363
+ | Other | 24.82 | 24.82 | 23.88 | 40.39 | 26.48 | 29.59 | 22.22 |
364
+ | STEM | 28.18 | 26.91 | 24.64 | 39.82 | 26.36 | 32.18 | 20.91 |
365
+ | **Overall** | 25.34 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 |
366
+
367
+ #### KMMLU-HARD (5-shot)
368
+
369
+ | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
370
+ |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
371
+ | Applied Science | 21 | 29 | 12 | 31 | 21 | 25 | 20 |
372
+ | HUMSS | 22.88 | 19.92 | 14 | 43.98 | 23.47 | 33.53 | 19.53 |
373
+ | Other | 25.13 | 27.27 | 12.83 | 39.84 | 28.34 | 29.68 | 23.22 |
374
+ | STEM | 21.75 | 25.25 | 12.75 | 40.25 | 23.25 | 27.25 | 19.75 |
375
+ | **Overall** | 25.66 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 |