BigQwen2.5-Echo-47B-Instruct
BigQwen2.5-Echo-47B-Instruct is a Qwen/Qwen2-32B-Instruct self-merge made with MergeKit.
π Echo Merge
I've tried a more gradual approach with a distributed repetition pattern. Instead of replicating blocks of 8 or more layers, I'm replicating individual layers in these blocks:
- First 8 layers: No replication
- Next 8 layers: Replicate 2 layers (first one, middle one)
- Next 8 layers: Replicate 4 layers (1st, 3rd, 5th, 7th)
- Next 8 layers: Replicate 8 layers (all of them)
- Next 8 layers: Replicate 4 layers (1st, 3rd, 5th, 7th)
- Next 8 layers: Replicate 2 layers (first one, middle one)
- First 8 layers: No replication
I used this string to visualize it, where 0 are original layers and 1 duplicated ones (the order doesn't matter):
00000000 1000010000 100100100100 1010101010101010 1010101010101010 100100100100 1000010000 00000000
The main idea is that the input/output difference of middle layers is quite small, so replicating a middle layer has a small impact on the output. The additional layers are designed to increase the model's capacity without breaking the information flow, which often creates "insane" self-merges.
π Evaluation
Metric | BigQwen2.5-Echo-47B-Instruct | BigQwen2.5-52B-Instruct | Qwen2.5-32B-Instruct |
---|---|---|---|
Avg. | 30.31 | 37.42 | 36.17 |
IFEval (0-Shot) | 73.57 | 79.29 | 83.46 |
BBH (3-Shot) | 44.52 | 59.81 | 56.49 |
MATH Lvl 5 (4-Shot) | 3.47 | 17.82 | 0 |
GPQA (0-shot) | 8.61 | 6.94 | 11.74 |
MuSR (0-shot) | 10.19 | 10.45 | 13.5 |
MMLU-PRO (5-shot) | 41.49 | 50.22 | 51.85 |
𧩠Configuration
The following YAML configuration was used to produce this model:
slices:
# First 8 layers: No replication
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [0, 8]
# Next 8 layers: Replicate 2 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [8, 9]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [8, 9]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [9, 13]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [13, 14]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [13, 14]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [14, 16]
# Next 8 layers: Replicate 4 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [16, 18]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [17, 19]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [18, 20]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [19, 21]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [20, 22]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [21, 23]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [22, 24]
# Next 8 layers: Replicate all 8 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [24, 25]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [24, 26]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [25, 27]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [26, 28]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [27, 29]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [28, 30]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [29, 31]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [30, 32]
# Middle 8 layers: Replicate all 8 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [32, 33]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [32, 34]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [33, 35]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [34, 36]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [35, 37]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [36, 38]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [37, 39]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [38, 40]
# Next 8 layers: Replicate 4 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [40, 42]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [41, 43]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [42, 44]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [43, 45]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [44, 46]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [45, 47]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [46, 48]
# Next 8 layers: Replicate 2 layers
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [48, 49]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [48, 49]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [49, 53]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [53, 54]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [53, 54]
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [54, 56]
# Last 8 layers: No replication
- sources:
- model: Qwen/Qwen2.5-32B-Instruct
layer_range: [56, 64]
merge_method: passthrough
dtype: bfloat16
π» Usage
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "mlabonne/BigQwen2.5-Echo-47B-Instruct"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
- Downloads last month
- 14
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for mlabonne/BigQwen2.5-Echo-47B-Instruct
Evaluation results
- strict accuracy on IFEval (0-Shot)Open LLM Leaderboard73.570
- normalized accuracy on BBH (3-Shot)Open LLM Leaderboard44.520
- exact match on MATH Lvl 5 (4-Shot)Open LLM Leaderboard3.470
- acc_norm on GPQA (0-shot)Open LLM Leaderboard8.610
- acc_norm on MuSR (0-shot)Open LLM Leaderboard10.190
- accuracy on MMLU-PRO (5-shot)test set Open LLM Leaderboard41.490