Text Generation
PEFT
Safetensors
mistral
conversational
Eval Results
dfurman commited on
Commit
9109d1a
1 Parent(s): 4fbdb08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -21
README.md CHANGED
@@ -36,17 +36,21 @@ This model was built via parameter-efficient finetuning of the [meta-llama/Llama
36
 
37
  - **Repository:** [here](https://github.com/daniel-furman/sft-demos/blob/main/src/sft/llama-2/sft_Llama_2_13B_Instruct_v0_2_peft.ipynb)
38
 
39
- ## Evaluation Results
 
40
 
41
- | Metric | Value |
42
- |-----------------------|-------|
43
- | MMLU (5-shot) | Coming |
44
- | ARC (25-shot) | Coming |
45
- | HellaSwag (10-shot) | Coming |
46
- | TruthfulQA (0-shot) | Coming |
47
- | Avg. | Coming |
48
 
49
- We use Eleuther.AI's [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
 
 
 
 
 
 
 
 
 
50
 
51
  ## Basic Usage
52
 
@@ -254,16 +258,4 @@ dryanfurman at gmail
254
 
255
 
256
  - PEFT 0.6.3.dev0
257
- # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
258
- Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_dfurman__llama-2-13b-dolphin-peft)
259
 
260
- | Metric | Value |
261
- |-----------------------|---------------------------|
262
- | Avg. | 17.2 |
263
- | ARC (25-shot) | 22.7 |
264
- | HellaSwag (10-shot) | 25.04 |
265
- | MMLU (5-shot) | 23.12 |
266
- | TruthfulQA (0-shot) | 0.0 |
267
- | Winogrande (5-shot) | 49.57 |
268
- | GSM8K (5-shot) | 0.0 |
269
- | DROP (3-shot) | 0.0 |
 
36
 
37
  - **Repository:** [here](https://github.com/daniel-furman/sft-demos/blob/main/src/sft/llama-2/sft_Llama_2_13B_Instruct_v0_2_peft.ipynb)
38
 
39
+ ## Open LLM Leaderboard Evaluation Results
40
+ Detailed results can be found [here](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
41
 
42
+ *Note*: The below values do not apply the [prompt formatting](https://huggingface.co/dfurman/Llama-2-13B-Instruct-v0.2#prompt-format) used to finetune the model. An action item for future development is to run these evaluation benchmarks with the formatting applied, which should increase the scores.
 
 
 
 
 
 
43
 
44
+ | Metric | Value |
45
+ |-----------------------|---------------------------|
46
+ | Avg. | 48.3 |
47
+ | ARC (25-shot) | 60.58 |
48
+ | HellaSwag (10-shot) | 81.96 |
49
+ | MMLU (5-shot) | 55.46 |
50
+ | TruthfulQA (0-shot) | 45.71 |
51
+ | Winogrande (5-shot) | 77.82 |
52
+ | GSM8K (5-shot) | 9.33 |
53
+ | DROP (3-shot) | 7.22 |
54
 
55
  ## Basic Usage
56
 
 
258
 
259
 
260
  - PEFT 0.6.3.dev0
 
 
261