File size: 6,990 Bytes
63c341f b45f10a 63c341f b45f10a f880b55 7a33f85 b45f10a f880b55 5f651ca f880b55 56ab705 717e5c3 56ab705 63c341f 56ab705 63c341f 56ab705 63c341f 56ab705 63c341f 717e5c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
base_model:
- cstr/llama3.1-8b-spaetzle-v85
- cstr/llama3.1-8b-spaetzle-v86
- cstr/llama3.1-8b-spaetzle-v74
tags:
- merge
- mergekit
- lazymergekit
- cstr/llama3.1-8b-spaetzle-v85
- cstr/llama3.1-8b-spaetzle-v86
- cstr/llama3.1-8b-spaetzle-v74
license: llama3
language:
- en
- de
---
# llama3.1-8b-spaetzle-v90
llama3.1-8b-spaetzle-v90 is a progressive merge of merges.
# evaluation
German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)
[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__llama3.1-8b-spaetzle-v90)
| Metric |Value|
|-------------------|----:|
|Avg. |27.59|
|IFEval (0-Shot) |73.56|
|BBH (3-Shot) |32.76|
|MATH Lvl 5 (4-Shot)|13.37|
|GPQA (0-shot) | 4.36|
|MuSR (0-shot) |11.15|
|MMLU-PRO (5-shot) |30.34|
| Model |AGIEval|TruthfulQA|Bigbench|
|--------------------------------------------------------------------------------|------:|---------:|-------:|
|[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)| 42.05| 57.2| 44.75|
### AGIEval
| Task |Version| Metric |Value| |Stderr|
|------------------------------|------:|--------|----:|---|-----:|
|agieval_aqua_rat | 0|acc |24.02|± | 2.69|
| | |acc_norm|23.62|± | 2.67|
|agieval_logiqa_en | 0|acc |40.09|± | 1.92|
| | |acc_norm|39.78|± | 1.92|
|agieval_lsat_ar | 0|acc |22.17|± | 2.75|
| | |acc_norm|21.74|± | 2.73|
|agieval_lsat_lr | 0|acc |50.39|± | 2.22|
| | |acc_norm|45.29|± | 2.21|
|agieval_lsat_rc | 0|acc |64.31|± | 2.93|
| | |acc_norm|58.36|± | 3.01|
|agieval_sat_en | 0|acc |81.07|± | 2.74|
| | |acc_norm|73.79|± | 3.07|
|agieval_sat_en_without_passage| 0|acc |45.15|± | 3.48|
| | |acc_norm|38.83|± | 3.40|
|agieval_sat_math | 0|acc |40.91|± | 3.32|
| | |acc_norm|35.00|± | 3.22|
Average: 42.05%
### TruthfulQA
| Task |Version|Metric|Value| |Stderr|
|-------------|------:|------|----:|---|-----:|
|truthfulqa_mc| 1|mc1 |39.66|± | 1.71|
| | |mc2 |57.20|± | 1.51|
Average: 57.2%
### Bigbench
| Task |Version| Metric |Value| |Stderr|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59|
|bigbench_date_understanding | 0|multiple_choice_grade|70.46|± | 2.38|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|31.40|± | 2.89|
|bigbench_geometric_shapes | 0|multiple_choice_grade|33.43|± | 2.49|
| | |exact_str_match | 0.00|± | 0.00|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|24.29|± | 1.62|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87|
|bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18|
|bigbench_navigate | 0|multiple_choice_grade|50.20|± | 1.58|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.50|± | 1.03|
|bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.77|± | 1.49|
|bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55|
|bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59|
|bigbench_temporal_sequences | 0|multiple_choice_grade|45.70|± | 1.58|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.03|± | 0.90|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87|
Average: 44.75%
# merge tree
The merge tree involves the following models:
- NousResearch/Hermes-3-Llama-3.1-8B
- Undi95/Meta-Llama-3.1-8B-Claude
- Dampfinchen/Llama-3.1-8B-Ultra-Instruct
- VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct
- akjindal53244/Llama-3.1-Storm-8B
- nbeerbower/llama3.1-gutenberg-8B
- Undi95/Meta-Llama-3.1-8B-Claude
- DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
- nbeerbower/llama-3-wissenschaft-8B-v2
- Azure99/blossom-v5-llama3-8b
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
- princeton-nlp/Llama-3-Instruct-8B-SimPO
- Locutusque/llama-3-neural-chat-v1-8b
- Locutusque/Llama-3-Orca-1.0-8B
- DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental
- seedboxai/Llama-3-Kafka-8B-v0.2
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
- nbeerbower/llama-3-wissenschaft-8B-v2
- mlabonne/Daredevil-8B-abliterated-dpomix
There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.
## 🧩 Configuration
The final merge for this was:
```yaml
models:
- model: cstr/llama3.1-8b-spaetzle-v59
# no parameters necessary for base model
- model: cstr/llama3.1-8b-spaetzle-v85
parameters:
density: 0.65
weight: 0.3
- model: cstr/llama3.1-8b-spaetzle-v86
parameters:
density: 0.65
weight: 0.3
- model: cstr/llama3.1-8b-spaetzle-v74
parameters:
density: 0.65
weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
```
Among the previous steps:
```yaml
models:
- model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
t:
- value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16
```
## 💻 Usage
Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: [cstr/llama3.1-8b-spaetzle-v90-GGUF](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90-GGUF)
|