microsoft
/

Phi-3.5-mini-instruct

Text Generation

Transformers

Safetensors

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

nguyenbh

daekeun-ml commited on Aug 29

Commit

bdf5de1

•

1 Parent(s): cd6881a

Update README.md. Added Korean benchmarks. (#18)

Browse files

- Update README.md. Added Korean benchmarks. (7b38101d099f2dec07d288571e3a2c10f230e519)

Co-authored-by: Daekeun Kim <[email protected]>

Files changed (1) hide show

README.md +97 -1

README.md CHANGED Viewed

@@ -373,4 +373,100 @@ This project may contain trademarks or logos for projects, products, or services
 |-----------|-----------------------|---------------------------------------|--------------------------|---------------------------|------------------|----------------|------------------|-------------------------------|
 | English   | 94.6                  | 94.6                                  | 85.6                     | 94.4                      | 37.6             | 63.8           | 92.0             | 98.2                          |
 | Italian   | 86.8                  | 84.8                                  | 76.8                     | 83.2                      | 16.2             | 37.2           | 85.6             | 97.6                          |
-| Turkish   | 58.6                  | 57.2                                  | 61.6                     | 56.6                      | 38.4             | 60.2           | 91.4             | 94.6                          |

 |-----------|-----------------------|---------------------------------------|--------------------------|---------------------------|------------------|----------------|------------------|-------------------------------|
 | English   | 94.6                  | 94.6                                  | 85.6                     | 94.4                      | 37.6             | 63.8           | 92.0             | 98.2                          |
 | Italian   | 86.8                  | 84.8                                  | 76.8                     | 83.2                      | 16.2             | 37.2           | 85.6             | 97.6                          |
+| Turkish   | 58.6                  | 57.2                                  | 61.6                     | 56.6                      | 38.4             | 60.2           | 91.4             | 94.6                          |
+## Appendix B: Korean benchmarks
+The prompt is the same as the [CLIcK paper](https://arxiv.org/abs/2403.06412) prompt. The experimental results below were given with max_tokens=512 (zero-shot), max_tokens=1024 (5-shot), temperature=0.01. No system prompt used.
+- GPT-4o: 2024-05-13 version
+- GPT-4o-mini: 2024-07-18 version
+- GPT-4-turbo: 2024-04-09 version
+- GPT-3.5-turbo: 2023-06-13 version
+| Benchmarks               |   Phi-3.5-Mini-Instruct |  Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:-------------------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| CLIcK                    |                   42.99 |                           29.12 |                   47.82 |    80.46 |         68.5  |         72.82 |           50.98 |
+| HAERAE 1.0               |                   44.21 |                           36.41 |                   53.9  |    85.7  |         76.4  |         77.76 |           52.67 |
+| KMMLU (0-shot, CoT)      |                   35.87 |                           30.82 |                   38.54 |    64.26 |         52.63 |         58.75 |           40.3  |
+| KMMLU (5-shot)           |                   37.35 |                           29.98 |                   20.21 |    64.28 |         51.62 |         59.29 |           42.28 |
+| KMMLU-HARD (0-shot, CoT) |                   24    |                           25.68 |                   24.03 |    39.62 |         24.56 |         30.56 |           20.97 |
+| KMMLU-HARD (5-shot)      |                   24.76 |                           25.73 |                   15.81 |    40.94 |         24.63 |         31.12 |           21.19 |
+| Average                  |                   35.62 |                           29.99 |                   29.29 |    62.54 |         50.08 |         56.74 |           39.61 |
+#### CLIcK (Cultural and Linguistic Intelligence in Korean)
+##### Accuracy by supercategory
+| supercategory   |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Culture         |                   43.77 |                           29.74 |                   51.15 |    81.89 |         70.95 |         73.61 |           53.38 |
+| Language        |                   41.38 |                           27.85 |                   40.92 |    77.54 |         63.54 |         71.23 |           46    |
+| **Overall**     |                   42.99 |                           29.12 |                   47.82 |    80.46 |         68.5  |         72.82 |           50.98 |
+##### Accuracy by category
+| supercategory   | category    |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|:------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Culture         | Economy     |                   61.02 |                           28.81 |                   66.1  |    94.92 |         83.05 |         89.83 |           64.41 |
+| Culture         | Geography   |                   45.8  |                           29.01 |                   54.2  |    80.15 |         77.86 |         82.44 |           53.44 |
+| Culture         | History     |                   26.15 |                           30    |                   29.64 |    66.92 |         48.4  |         46.4  |           31.79 |
+| Culture         | Law         |                   32.42 |                           22.83 |                   44.29 |    70.78 |         57.53 |         61.19 |           41.55 |
+| Culture         | Politics    |                   54.76 |                           33.33 |                   59.52 |    88.1  |         83.33 |         89.29 |           65.48 |
+| Culture         | Pop Culture |                   60.98 |                           34.15 |                   60.98 |    97.56 |         85.37 |         92.68 |           75.61 |
+| Culture         | Society     |                   54.37 |                           31.72 |                   65.05 |    92.88 |         85.44 |         86.73 |           71.2  |
+| Culture         | Tradition   |                   47.75 |                           31.98 |                   54.95 |    87.39 |         74.77 |         79.28 |           55.86 |
+| Language        | Functional  |                   37.6  |                           24    |                   32.8  |    84.8  |         64.8  |         80    |           40    |
+| Language        | Grammar     |                   27.5  |                           23.33 |                   22.92 |    57.08 |         42.5  |         47.5  |           30    |
+| Language        | Textual     |                   54.74 |                           33.33 |                   59.65 |    91.58 |         80.7  |         87.37 |           62.11 |
+#### HAERAE
+| category              |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| General Knowledge     |                   31.25 |                           28.41 |                   34.66 |    77.27 |         53.41 |         66.48 |           40.91 |
+| History               |                   32.45 |                           22.34 |                   44.15 |    92.02 |         84.57 |         78.72 |           30.32 |
+| Loan Words            |                   47.93 |                           35.5  |                   63.31 |    79.88 |         76.33 |         78.11 |           59.17 |
+| Rare Words            |                   55.06 |                           42.96 |                   63.21 |    87.9  |         81.98 |         79.01 |           61.23 |
+| Reading Comprehension |                   42.95 |                           41.16 |                   51.9  |    85.46 |         77.18 |         80.09 |           56.15 |
+| Standard Nomenclature |                   44.44 |                           32.68 |                   58.82 |    88.89 |         75.82 |         79.08 |           53.59 |
+| **Overall**           |                   44.21 |                           36.41 |                   53.9  |    85.7  |         76.4  |         77.76 |           52.67 |
+#### KMMLU (0-shot, CoT)
+| supercategory   |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                   35.8  |                           31.68 |                   37.03 |    61.52 |         49.29 |         55.98 |           38.47 |
+| HUMSS           |                   31.56 |                           26.47 |                   37.29 |    69.45 |         56.59 |         63    |           40.9  |
+| Other           |                   35.45 |                           31.01 |                   39.15 |    63.79 |         52.35 |         57.53 |           40.19 |
+| STEM            |                   38.54 |                           31.9  |                   40.42 |    65.16 |         54.74 |         60.84 |           42.24 |
+| **Overall**     |                   35.87 |                           30.82 |                   38.54 |    64.26 |         52.63 |         58.75 |           40.3  |
+#### KMMLU (5-shot)
+| supercategory   |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                   37.42 |                           29.98 |                   19.24 |    61.47 |         48.66 |         56.85 |           40.22 |
+| HUMSS           |                   34.72 |                           27.27 |                   22.5  |    68.79 |         55.95 |         63.68 |           43.35 |
+| Other           |                   37.04 |                           30.76 |                   20.95 |    64.21 |         51.1  |         57.85 |           41.92 |
+| STEM            |                   38.9  |                           30.73 |                   19.55 |    65.28 |         53.29 |         61.08 |           44.43 |
+| **Overall**     |                   37.35 |                           29.98 |                   20.21 |    64.28 |         51.62 |         59.29 |           42.28 |
+#### KMMLU-HARD (0-shot, CoT)
+| supercategory   |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                   27.08 |                           26.17 |                   26.25 |    37.12 |         22.25 |         29.17 |           21.07 |
+| HUMSS           |                   20.21 |                           24.38 |                   20.21 |    41.97 |         23.31 |         31.51 |           19.44 |
+| Other           |                   23.05 |                           24.82 |                   23.88 |    40.39 |         26.48 |         29.59 |           22.22 |
+| STEM            |                   24.36 |                           26.91 |                   24.64 |    39.82 |         26.36 |         32.18 |           20.91 |
+| **Overall**     |                   24    |                           25.68 |                   24.03 |    39.62 |         24.56 |         30.56 |           20.97 |
+#### KMMLU-HARD (5-shot)
+| supercategory   |   Phi-3.5-Mini-Instruct |   Phi-3.0-Mini-128k-Instruct (June2024) |   Llama-3.1-8B-Instruct |   GPT-4o |   GPT-4o-mini |   GPT-4-turbo |   GPT-3.5-turbo |
+|:----------------|------------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:|
+| Applied Science |                   25    |                           29    |                   12    |    31    |         21    |         25    |           20    |
+| HUMSS           |                   21.89 |                           19.92 |                   14    |    43.98 |         23.47 |         33.53 |           19.53 |
+| Other           |                   23.26 |                           27.27 |                   12.83 |    39.84 |         28.34 |         29.68 |           23.22 |
+| STEM            |                   20.5  |                           25.25 |                   12.75 |    40.25 |         23.25 |         27.25 |           19.75 |
+| **Overall**     |                   24.76 |                           25.73 |                   15.81 |    40.94 |         24.63 |         31.12 |           21.19 |