Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0:

Model Details:

Model Description:

Finetuned from model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 on teknium/openhermes.
We pruned the 4 layers of meta-llama/Meta-Llama-3.1-8B that had the less impact on the performance of the model according to the paper The Unreasonable Ineffectiveness of the Deeper Layers.
We have therefore 1.09B parameters less than the foundation model, which means less memory needed, faster training and less latency during inference mode.
We then recovered the performance loss induced by the pruning process by fine-tuning (from 0.2642 MMLU-Pro 0-shot to 0.3120), this step is called healing the pruned model.

Upcoming Work:

More healing through SFT/DPO/TPO to see if we can get closer to the meta-llama/Meta-Llama-3.1-8B performance (which has an MMLU-Pro 0-shot of 0.3659 vs 0.3120 for our model). (In Progress)
Compare the same exact process when applied to meta-llama/LLama-3.1-70B.

Training Details:

model = FastLanguageModel.get_peft_model(
model,
r = 4, 
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",],
lora_alpha = 4,
lora_dropout = 0.05, 
bias = "none",    

use_gradient_checkpointing = "unsloth", 
random_state = 3407,
use_rslora = False,  
loftq_config = None, 
)

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "completion",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, 
args = TrainingArguments(
    per_device_train_batch_size = 10,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps=5000,
    learning_rate = 2e-4,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 1,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    seed = 3407,
    output_dir = "outputs_4",
    push_to_hub=True,
    hub_always_push=True,
),
)

Training Data:

teknium/openhermes

Memory and Latency gain (Using Optimum-Benchmark):

Load Mode Memory Metrics

Model	Max Global VRAM (MB)	Max Process VRAM (MB)	Max Reserved VRAM (MB)	Max Allocated VRAM (MB)
Llama-3.1-8B	18521.98	16630.42	16196.30	16060.54
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	16319.97	14428.41	13994.30	13879.42

Inference Mode Latency Metrics

Model	Latency Mean (s)	Throughput (tokens/s)
Llama-3.1-8B	0.8104	38.2536
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.5530	56.0570

Evaluation:

(Foundation model) MMLU Pro 0-shot of meta-llama/Meta-Llama-3.1-8B: 0.3659
(Pruned model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers: 0.2642
(Healed model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: 0.3120

Evaluation Data and Process:

TIGER-AI-Lab/MMLU-Pro.
HuggingFace Lighteval benchmarking repo.

Additional Benchmark Results

BoolQ 0-shots Benchmark Results

Model	Average Score	boolq (0 shots)	boolq contrastset (0 shots)
meta-llama/Meta-Llama-3.1-8B	0.569	0.569	0.568
Na0s/Llama-3.1-8B-Pruned-4-Layers	0.240	0.240	0.240
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.833	0.834	0.831

BigBench 0-shots Benchmark Results

Model	Average Score	bigbench:causal_judgment (0 shots)	bigbench:date_understanding (0 shots)	bigbench:disambiguation_qa (0 shots)	bigbench:geometric_shapes (0 shots)	bigbench:logical_deduction (0 shots)	...
meta-llama/Meta-Llama-3.1-8B	0.351	0.574	0.499	0.302	0.164	0.208	...
Na0s/Llama-3.1-8B-Pruned-4-Layers	0.299	0.537	0.341	0.314	0.200	0.212	...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.364	0.579	0.610	0.407	0.264	0.208	...

Few Shots Benchmark Results

Model	Average Score	arc:challenge (25 shots)	hellaswag (10 shots)	mmlu:abstract_algebra (5 shots)	mmlu:college_chemistry (5 shots)	mmlu:college_computer_science (5 shots)	mmlu:college_mathematics (5 shots)	...
meta-llama/Meta-Llama-3.1-8B	0.552	0.541	0.620	0.290	0.450	0.480	0.350	...
Na0s/Llama-3.1-8B-Pruned-4-Layers	0.516	0.462	0.549	0.290	0.440	0.460	0.280	...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.544	0.479	0.554	0.340	0.480	0.520	0.350	...

BigBench 3-shots Benchmark Results

Model	Average Score	bigbench:causal_judgment (3 shots)	bigbench:date_understanding (3 shots)	bigbench:disambiguation_qa (3 shots)	bigbench:geometric_shapes (3 shots)	bigbench:logical_deduction (3 shots)	...
meta-llama/Meta-Llama-3.1-8B	0.442	0.563	0.596	0.593	0.181	0.298	...
Na0s/Llama-3.1-8B-Pruned-4-Layers	0.420	0.563	0.642	0.574	0.217	0.258	...
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.450	0.621	0.686	0.663	0.225	0.332	...

Overall Average Score

Model	Overall Average Score
meta-llama/Meta-Llama-3.1-8B	0.472
Na0s/Llama-3.1-8B-Pruned-4-Layers	0.364
Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0	0.513

Environmental Impact:

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Na0s
/

Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0

Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0:

Model Details:

Model Description:

Upcoming Work:

Training Details:

Training Data:

Memory and Latency gain (Using Optimum-Benchmark):

Evaluation:

Evaluation Data and Process:

Additional Benchmark Results

BoolQ 0-shots Benchmark Results

BigBench 0-shots Benchmark Results

Few Shots Benchmark Results

BigBench 3-shots Benchmark Results

Overall Average Score

Environmental Impact:

Model tree for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0

Dataset used to train Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0