cstr's picture
Update README.md
5f651ca verified
metadata
base_model:
  - cstr/llama3.1-8b-spaetzle-v85
  - cstr/llama3.1-8b-spaetzle-v86
  - cstr/llama3.1-8b-spaetzle-v74
tags:
  - merge
  - mergekit
  - lazymergekit
  - cstr/llama3.1-8b-spaetzle-v85
  - cstr/llama3.1-8b-spaetzle-v86
  - cstr/llama3.1-8b-spaetzle-v74
license: llama3
language:
  - en
  - de

llama3.1-8b-spaetzle-v90

llama3.1-8b-spaetzle-v90 is a progressive merge of merges.

evaluation

German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)

Open LLM Leaderboard Evaluation Results Detailed results can be found here

Metric Value
Avg. 27.59
IFEval (0-Shot) 73.56
BBH (3-Shot) 32.76
MATH Lvl 5 (4-Shot) 13.37
GPQA (0-shot) 4.36
MuSR (0-shot) 11.15
MMLU-PRO (5-shot) 30.34
Model AGIEval TruthfulQA Bigbench
llama3.1-8b-spaetzle-v90 42.05 57.2 44.75

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 24.02 ± 2.69
acc_norm 23.62 ± 2.67
agieval_logiqa_en 0 acc 40.09 ± 1.92
acc_norm 39.78 ± 1.92
agieval_lsat_ar 0 acc 22.17 ± 2.75
acc_norm 21.74 ± 2.73
agieval_lsat_lr 0 acc 50.39 ± 2.22
acc_norm 45.29 ± 2.21
agieval_lsat_rc 0 acc 64.31 ± 2.93
acc_norm 58.36 ± 3.01
agieval_sat_en 0 acc 81.07 ± 2.74
acc_norm 73.79 ± 3.07
agieval_sat_en_without_passage 0 acc 45.15 ± 3.48
acc_norm 38.83 ± 3.40
agieval_sat_math 0 acc 40.91 ± 3.32
acc_norm 35.00 ± 3.22

Average: 42.05%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 39.66 ± 1.71
mc2 57.20 ± 1.51

Average: 57.2%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 58.42 ± 3.59
bigbench_date_understanding 0 multiple_choice_grade 70.46 ± 2.38
bigbench_disambiguation_qa 0 multiple_choice_grade 31.40 ± 2.89
bigbench_geometric_shapes 0 multiple_choice_grade 33.43 ± 2.49
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 30.00 ± 2.05
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 24.29 ± 1.62
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 56.00 ± 2.87
bigbench_movie_recommendation 0 multiple_choice_grade 38.20 ± 2.18
bigbench_navigate 0 multiple_choice_grade 50.20 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 69.50 ± 1.03
bigbench_ruin_names 0 multiple_choice_grade 54.46 ± 2.36
bigbench_salient_translation_error_detection 0 multiple_choice_grade 32.77 ± 1.49
bigbench_snarks 0 multiple_choice_grade 65.19 ± 3.55
bigbench_sports_understanding 0 multiple_choice_grade 50.30 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 45.70 ± 1.58
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.08 ± 1.17
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.03 ± 0.90
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 56.00 ± 2.87

Average: 44.75%

merge tree

The merge tree involves the following models:

  • NousResearch/Hermes-3-Llama-3.1-8B
  • Undi95/Meta-Llama-3.1-8B-Claude
  • Dampfinchen/Llama-3.1-8B-Ultra-Instruct
  • VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct
  • akjindal53244/Llama-3.1-Storm-8B
  • nbeerbower/llama3.1-gutenberg-8B
  • Undi95/Meta-Llama-3.1-8B-Claude
  • DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
  • nbeerbower/llama-3-wissenschaft-8B-v2
  • Azure99/blossom-v5-llama3-8b
  • VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
  • princeton-nlp/Llama-3-Instruct-8B-SimPO
  • Locutusque/llama-3-neural-chat-v1-8b
  • Locutusque/Llama-3-Orca-1.0-8B
  • DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental
  • seedboxai/Llama-3-Kafka-8B-v0.2
  • VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
  • nbeerbower/llama-3-wissenschaft-8B-v2
  • mlabonne/Daredevil-8B-abliterated-dpomix

There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.

🧩 Configuration

The final merge for this was:

models:
  - model: cstr/llama3.1-8b-spaetzle-v59
    # no parameters necessary for base model
  - model: cstr/llama3.1-8b-spaetzle-v85
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v86
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v74
    parameters:
      density: 0.65
      weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

Among the previous steps:

models:
  - model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
  t:
    - value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16

💻 Usage

Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: cstr/llama3.1-8b-spaetzle-v90-GGUF