Commit
•
bdf5de1
1
Parent(s):
cd6881a
Update README.md. Added Korean benchmarks. (#18)
Browse files- Update README.md. Added Korean benchmarks. (7b38101d099f2dec07d288571e3a2c10f230e519)
Co-authored-by: Daekeun Kim <[email protected]>
README.md
CHANGED
@@ -373,4 +373,100 @@ This project may contain trademarks or logos for projects, products, or services
|
|
373 |
|-----------|-----------------------|---------------------------------------|--------------------------|---------------------------|------------------|----------------|------------------|-------------------------------|
|
374 |
| English | 94.6 | 94.6 | 85.6 | 94.4 | 37.6 | 63.8 | 92.0 | 98.2 |
|
375 |
| Italian | 86.8 | 84.8 | 76.8 | 83.2 | 16.2 | 37.2 | 85.6 | 97.6 |
|
376 |
-
| Turkish | 58.6 | 57.2 | 61.6 | 56.6 | 38.4 | 60.2 | 91.4 | 94.6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
373 |
|-----------|-----------------------|---------------------------------------|--------------------------|---------------------------|------------------|----------------|------------------|-------------------------------|
|
374 |
| English | 94.6 | 94.6 | 85.6 | 94.4 | 37.6 | 63.8 | 92.0 | 98.2 |
|
375 |
| Italian | 86.8 | 84.8 | 76.8 | 83.2 | 16.2 | 37.2 | 85.6 | 97.6 |
|
376 |
+
| Turkish | 58.6 | 57.2 | 61.6 | 56.6 | 38.4 | 60.2 | 91.4 | 94.6 |
|
377 |
+
|
378 |
+
|
379 |
+
## Appendix B: Korean benchmarks
|
380 |
+
|
381 |
+
The prompt is the same as the [CLIcK paper](https://arxiv.org/abs/2403.06412) prompt. The experimental results below were given with max_tokens=512 (zero-shot), max_tokens=1024 (5-shot), temperature=0.01. No system prompt used.
|
382 |
+
|
383 |
+
- GPT-4o: 2024-05-13 version
|
384 |
+
- GPT-4o-mini: 2024-07-18 version
|
385 |
+
- GPT-4-turbo: 2024-04-09 version
|
386 |
+
- GPT-3.5-turbo: 2023-06-13 version
|
387 |
+
|
388 |
+
| Benchmarks | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
389 |
+
|:-------------------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
390 |
+
| CLIcK | 42.99 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 |
|
391 |
+
| HAERAE 1.0 | 44.21 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 |
|
392 |
+
| KMMLU (0-shot, CoT) | 35.87 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 |
|
393 |
+
| KMMLU (5-shot) | 37.35 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 |
|
394 |
+
| KMMLU-HARD (0-shot, CoT) | 24 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 |
|
395 |
+
| KMMLU-HARD (5-shot) | 24.76 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 |
|
396 |
+
| Average | 35.62 | 29.99 | 29.29 | 62.54 | 50.08 | 56.74 | 39.61 |
|
397 |
+
|
398 |
+
#### CLIcK (Cultural and Linguistic Intelligence in Korean)
|
399 |
+
|
400 |
+
##### Accuracy by supercategory
|
401 |
+
| supercategory | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
402 |
+
|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
403 |
+
| Culture | 43.77 | 29.74 | 51.15 | 81.89 | 70.95 | 73.61 | 53.38 |
|
404 |
+
| Language | 41.38 | 27.85 | 40.92 | 77.54 | 63.54 | 71.23 | 46 |
|
405 |
+
| **Overall** | 42.99 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 |
|
406 |
+
|
407 |
+
##### Accuracy by category
|
408 |
+
| supercategory | category | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
409 |
+
|:----------------|:------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
410 |
+
| Culture | Economy | 61.02 | 28.81 | 66.1 | 94.92 | 83.05 | 89.83 | 64.41 |
|
411 |
+
| Culture | Geography | 45.8 | 29.01 | 54.2 | 80.15 | 77.86 | 82.44 | 53.44 |
|
412 |
+
| Culture | History | 26.15 | 30 | 29.64 | 66.92 | 48.4 | 46.4 | 31.79 |
|
413 |
+
| Culture | Law | 32.42 | 22.83 | 44.29 | 70.78 | 57.53 | 61.19 | 41.55 |
|
414 |
+
| Culture | Politics | 54.76 | 33.33 | 59.52 | 88.1 | 83.33 | 89.29 | 65.48 |
|
415 |
+
| Culture | Pop Culture | 60.98 | 34.15 | 60.98 | 97.56 | 85.37 | 92.68 | 75.61 |
|
416 |
+
| Culture | Society | 54.37 | 31.72 | 65.05 | 92.88 | 85.44 | 86.73 | 71.2 |
|
417 |
+
| Culture | Tradition | 47.75 | 31.98 | 54.95 | 87.39 | 74.77 | 79.28 | 55.86 |
|
418 |
+
| Language | Functional | 37.6 | 24 | 32.8 | 84.8 | 64.8 | 80 | 40 |
|
419 |
+
| Language | Grammar | 27.5 | 23.33 | 22.92 | 57.08 | 42.5 | 47.5 | 30 |
|
420 |
+
| Language | Textual | 54.74 | 33.33 | 59.65 | 91.58 | 80.7 | 87.37 | 62.11 |
|
421 |
+
|
422 |
+
#### HAERAE
|
423 |
+
|
424 |
+
| category | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
425 |
+
|:----------------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
426 |
+
| General Knowledge | 31.25 | 28.41 | 34.66 | 77.27 | 53.41 | 66.48 | 40.91 |
|
427 |
+
| History | 32.45 | 22.34 | 44.15 | 92.02 | 84.57 | 78.72 | 30.32 |
|
428 |
+
| Loan Words | 47.93 | 35.5 | 63.31 | 79.88 | 76.33 | 78.11 | 59.17 |
|
429 |
+
| Rare Words | 55.06 | 42.96 | 63.21 | 87.9 | 81.98 | 79.01 | 61.23 |
|
430 |
+
| Reading Comprehension | 42.95 | 41.16 | 51.9 | 85.46 | 77.18 | 80.09 | 56.15 |
|
431 |
+
| Standard Nomenclature | 44.44 | 32.68 | 58.82 | 88.89 | 75.82 | 79.08 | 53.59 |
|
432 |
+
| **Overall** | 44.21 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 |
|
433 |
+
|
434 |
+
#### KMMLU (0-shot, CoT)
|
435 |
+
|
436 |
+
| supercategory | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
437 |
+
|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
438 |
+
| Applied Science | 35.8 | 31.68 | 37.03 | 61.52 | 49.29 | 55.98 | 38.47 |
|
439 |
+
| HUMSS | 31.56 | 26.47 | 37.29 | 69.45 | 56.59 | 63 | 40.9 |
|
440 |
+
| Other | 35.45 | 31.01 | 39.15 | 63.79 | 52.35 | 57.53 | 40.19 |
|
441 |
+
| STEM | 38.54 | 31.9 | 40.42 | 65.16 | 54.74 | 60.84 | 42.24 |
|
442 |
+
| **Overall** | 35.87 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 |
|
443 |
+
|
444 |
+
#### KMMLU (5-shot)
|
445 |
+
|
446 |
+
| supercategory | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
447 |
+
|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
448 |
+
| Applied Science | 37.42 | 29.98 | 19.24 | 61.47 | 48.66 | 56.85 | 40.22 |
|
449 |
+
| HUMSS | 34.72 | 27.27 | 22.5 | 68.79 | 55.95 | 63.68 | 43.35 |
|
450 |
+
| Other | 37.04 | 30.76 | 20.95 | 64.21 | 51.1 | 57.85 | 41.92 |
|
451 |
+
| STEM | 38.9 | 30.73 | 19.55 | 65.28 | 53.29 | 61.08 | 44.43 |
|
452 |
+
| **Overall** | 37.35 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 |
|
453 |
+
|
454 |
+
#### KMMLU-HARD (0-shot, CoT)
|
455 |
+
|
456 |
+
| supercategory | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
457 |
+
|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
458 |
+
| Applied Science | 27.08 | 26.17 | 26.25 | 37.12 | 22.25 | 29.17 | 21.07 |
|
459 |
+
| HUMSS | 20.21 | 24.38 | 20.21 | 41.97 | 23.31 | 31.51 | 19.44 |
|
460 |
+
| Other | 23.05 | 24.82 | 23.88 | 40.39 | 26.48 | 29.59 | 22.22 |
|
461 |
+
| STEM | 24.36 | 26.91 | 24.64 | 39.82 | 26.36 | 32.18 | 20.91 |
|
462 |
+
| **Overall** | 24 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 |
|
463 |
+
|
464 |
+
#### KMMLU-HARD (5-shot)
|
465 |
+
|
466 |
+
| supercategory | Phi-3.5-Mini-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo |
|
467 |
+
|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
|
468 |
+
| Applied Science | 25 | 29 | 12 | 31 | 21 | 25 | 20 |
|
469 |
+
| HUMSS | 21.89 | 19.92 | 14 | 43.98 | 23.47 | 33.53 | 19.53 |
|
470 |
+
| Other | 23.26 | 27.27 | 12.83 | 39.84 | 28.34 | 29.68 | 23.22 |
|
471 |
+
| STEM | 20.5 | 25.25 | 12.75 | 40.25 | 23.25 | 27.25 | 19.75 |
|
472 |
+
| **Overall** | 24.76 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 |
|