dfurman
/

Llama-2-13B-Instruct-v0.2

Text Generation

Model card Files Files and versions Community

dfurman commited on Nov 20, 2023

Commit

9109d1a

•

1 Parent(s): 4fbdb08

Update README.md

Files changed (1) hide show

README.md +13 -21

README.md CHANGED Viewed

@@ -36,17 +36,21 @@ This model was built via parameter-efficient finetuning of the [meta-llama/Llama
 - **Repository:** [here](https://github.com/daniel-furman/sft-demos/blob/main/src/sft/llama-2/sft_Llama_2_13B_Instruct_v0_2_peft.ipynb)
-## Evaluation Results
-| Metric                | Value |
-|-----------------------|-------|
-| MMLU (5-shot)         | Coming |
-| ARC (25-shot)         | Coming |
-| HellaSwag (10-shot)   | Coming |
-| TruthfulQA (0-shot)   | Coming |
-| Avg.                  | Coming |
-We use Eleuther.AI's [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
 ## Basic Usage
@@ -254,16 +258,4 @@ dryanfurman at gmail
 - PEFT 0.6.3.dev0
-# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
-Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_dfurman__llama-2-13b-dolphin-peft)
-| Metric                | Value                     |
-|-----------------------|---------------------------|
-| Avg.                  | 17.2   |
-| ARC (25-shot)         | 22.7          |
-| HellaSwag (10-shot)   | 25.04    |
-| MMLU (5-shot)         | 23.12         |
-| TruthfulQA (0-shot)   | 0.0   |
-| Winogrande (5-shot)   | 49.57   |
-| GSM8K (5-shot)        | 0.0        |
-| DROP (3-shot)         | 0.0         |

 - **Repository:** [here](https://github.com/daniel-furman/sft-demos/blob/main/src/sft/llama-2/sft_Llama_2_13B_Instruct_v0_2_peft.ipynb)
+## Open LLM Leaderboard Evaluation Results
+Detailed results can be found [here](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+*Note*: The below values do not apply the [prompt formatting](https://huggingface.co/dfurman/Llama-2-13B-Instruct-v0.2#prompt-format) used to finetune the model. An action item for future development is to run these evaluation benchmarks with the formatting applied, which should increase the scores.
+| Metric                | Value                     |
+|-----------------------|---------------------------|
+| Avg.                  | 48.3   |
+| ARC (25-shot)         | 60.58          |
+| HellaSwag (10-shot)   | 81.96    |
+| MMLU (5-shot)         | 55.46         |
+| TruthfulQA (0-shot)   | 45.71  |
+| Winogrande (5-shot)   | 77.82   |
+| GSM8K (5-shot)        | 9.33        |
+| DROP (3-shot)         | 7.22         |
 ## Basic Usage
 - PEFT 0.6.3.dev0