|
--- |
|
base_model: |
|
- cstr/llama3.1-8b-spaetzle-v85 |
|
- cstr/llama3.1-8b-spaetzle-v86 |
|
- cstr/llama3.1-8b-spaetzle-v74 |
|
tags: |
|
- merge |
|
- mergekit |
|
- lazymergekit |
|
- cstr/llama3.1-8b-spaetzle-v85 |
|
- cstr/llama3.1-8b-spaetzle-v86 |
|
- cstr/llama3.1-8b-spaetzle-v74 |
|
license: llama3 |
|
language: |
|
- en |
|
- de |
|
--- |
|
|
|
# llama3.1-8b-spaetzle-v90 |
|
|
|
llama3.1-8b-spaetzle-v90 is a progressive merge of merges. |
|
|
|
# evaluation |
|
|
|
German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171) |
|
|
|
[Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_cstr__llama3.1-8b-spaetzle-v90) |
|
|
|
| Metric |Value| |
|
|-------------------|----:| |
|
|Avg. |27.59| |
|
|IFEval (0-Shot) |73.56| |
|
|BBH (3-Shot) |32.76| |
|
|MATH Lvl 5 (4-Shot)|13.37| |
|
|GPQA (0-shot) | 4.36| |
|
|MuSR (0-shot) |11.15| |
|
|MMLU-PRO (5-shot) |30.34| |
|
|
|
| Model |AGIEval|TruthfulQA|Bigbench| |
|
|--------------------------------------------------------------------------------|------:|---------:|-------:| |
|
|[llama3.1-8b-spaetzle-v90](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90)| 42.05| 57.2| 44.75| |
|
|
|
### AGIEval |
|
| Task |Version| Metric |Value| |Stderr| |
|
|------------------------------|------:|--------|----:|---|-----:| |
|
|agieval_aqua_rat | 0|acc |24.02|± | 2.69| |
|
| | |acc_norm|23.62|± | 2.67| |
|
|agieval_logiqa_en | 0|acc |40.09|± | 1.92| |
|
| | |acc_norm|39.78|± | 1.92| |
|
|agieval_lsat_ar | 0|acc |22.17|± | 2.75| |
|
| | |acc_norm|21.74|± | 2.73| |
|
|agieval_lsat_lr | 0|acc |50.39|± | 2.22| |
|
| | |acc_norm|45.29|± | 2.21| |
|
|agieval_lsat_rc | 0|acc |64.31|± | 2.93| |
|
| | |acc_norm|58.36|± | 3.01| |
|
|agieval_sat_en | 0|acc |81.07|± | 2.74| |
|
| | |acc_norm|73.79|± | 3.07| |
|
|agieval_sat_en_without_passage| 0|acc |45.15|± | 3.48| |
|
| | |acc_norm|38.83|± | 3.40| |
|
|agieval_sat_math | 0|acc |40.91|± | 3.32| |
|
| | |acc_norm|35.00|± | 3.22| |
|
|
|
Average: 42.05% |
|
|
|
### TruthfulQA |
|
| Task |Version|Metric|Value| |Stderr| |
|
|-------------|------:|------|----:|---|-----:| |
|
|truthfulqa_mc| 1|mc1 |39.66|± | 1.71| |
|
| | |mc2 |57.20|± | 1.51| |
|
|
|
Average: 57.2% |
|
|
|
### Bigbench |
|
| Task |Version| Metric |Value| |Stderr| |
|
|------------------------------------------------|------:|---------------------|----:|---|-----:| |
|
|bigbench_causal_judgement | 0|multiple_choice_grade|58.42|± | 3.59| |
|
|bigbench_date_understanding | 0|multiple_choice_grade|70.46|± | 2.38| |
|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|31.40|± | 2.89| |
|
|bigbench_geometric_shapes | 0|multiple_choice_grade|33.43|± | 2.49| |
|
| | |exact_str_match | 0.00|± | 0.00| |
|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.00|± | 2.05| |
|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|24.29|± | 1.62| |
|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|56.00|± | 2.87| |
|
|bigbench_movie_recommendation | 0|multiple_choice_grade|38.20|± | 2.18| |
|
|bigbench_navigate | 0|multiple_choice_grade|50.20|± | 1.58| |
|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|69.50|± | 1.03| |
|
|bigbench_ruin_names | 0|multiple_choice_grade|54.46|± | 2.36| |
|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|32.77|± | 1.49| |
|
|bigbench_snarks | 0|multiple_choice_grade|65.19|± | 3.55| |
|
|bigbench_sports_understanding | 0|multiple_choice_grade|50.30|± | 1.59| |
|
|bigbench_temporal_sequences | 0|multiple_choice_grade|45.70|± | 1.58| |
|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.08|± | 1.17| |
|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|17.03|± | 0.90| |
|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|56.00|± | 2.87| |
|
|
|
Average: 44.75% |
|
|
|
# merge tree |
|
|
|
The merge tree involves the following models: |
|
|
|
- NousResearch/Hermes-3-Llama-3.1-8B |
|
- Undi95/Meta-Llama-3.1-8B-Claude |
|
- Dampfinchen/Llama-3.1-8B-Ultra-Instruct |
|
- VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct |
|
- akjindal53244/Llama-3.1-Storm-8B |
|
- nbeerbower/llama3.1-gutenberg-8B |
|
- Undi95/Meta-Llama-3.1-8B-Claude |
|
- DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1 |
|
- nbeerbower/llama-3-wissenschaft-8B-v2 |
|
- Azure99/blossom-v5-llama3-8b |
|
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct |
|
- princeton-nlp/Llama-3-Instruct-8B-SimPO |
|
- Locutusque/llama-3-neural-chat-v1-8b |
|
- Locutusque/Llama-3-Orca-1.0-8B |
|
- DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental |
|
- seedboxai/Llama-3-Kafka-8B-v0.2 |
|
- VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct |
|
- nbeerbower/llama-3-wissenschaft-8B-v2 |
|
- mlabonne/Daredevil-8B-abliterated-dpomix |
|
|
|
There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below. |
|
|
|
## 🧩 Configuration |
|
|
|
The final merge for this was: |
|
|
|
```yaml |
|
models: |
|
- model: cstr/llama3.1-8b-spaetzle-v59 |
|
# no parameters necessary for base model |
|
- model: cstr/llama3.1-8b-spaetzle-v85 |
|
parameters: |
|
density: 0.65 |
|
weight: 0.3 |
|
- model: cstr/llama3.1-8b-spaetzle-v86 |
|
parameters: |
|
density: 0.65 |
|
weight: 0.3 |
|
- model: cstr/llama3.1-8b-spaetzle-v74 |
|
parameters: |
|
density: 0.65 |
|
weight: 0.3 |
|
merge_method: dare_ties |
|
base_model: cstr/llama3.1-8b-spaetzle-v59 |
|
parameters: |
|
int8_mask: true |
|
dtype: bfloat16 |
|
random_seed: 0 |
|
tokenizer_source: base |
|
``` |
|
|
|
Among the previous steps: |
|
```yaml |
|
models: |
|
- model: NousResearch/Hermes-3-Llama-3.1-8B |
|
merge_method: slerp |
|
base_model: cstr/llama3.1-8b-spaetzle-v74 |
|
parameters: |
|
t: |
|
- value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0] |
|
dtype: float16 |
|
``` |
|
|
|
## 💻 Usage |
|
|
|
Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: [cstr/llama3.1-8b-spaetzle-v90-GGUF](https://huggingface.co/cstr/llama3.1-8b-spaetzle-v90-GGUF) |
|
|
|
|