language: en
tags:
- text-generation
- knowledge-distillation
- llama
- causal-lm
- openwebtext
- wikitext
- transfer-learning
model_name: DistilLLaMA
license: apache-2.0
datasets:
- openwebtext
- wikitext
parameter_count: 80M
metrics:
- cosine-similarity
- exact-match
- rouge
library_name: transformers
base_model: meta-llama/LLaMA-2-7B
Overview
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.
Update
30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.
Model Architecture
The architecture is based on LLaMA 2, with the following parameters:
Parameter | Value |
---|---|
Hidden Dimension | 512 |
Intermediate Dimension | 1536 |
Max Positional Embeddings | 128 |
Attention Heads | 8 |
Transformer Layers | 16 |
Evaluation Metrics
Cosine Similarity using Word Embeddings
- Description: Measures semantic similarity by mapping words/phrases to vectors.
- Equation: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
- Example: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)
Exact Match (EM)
- Description: Checks if critical keywords are present.
- Example:
- Expected: "Paris"
- Response: "The capital of France is Paris." (EM = 1)
ROUGE Score
- Description: Measures the overlap of the longest common subsequences between reference and response texts.
- Equation:
- Precision = Precision = LCS(R, C) / Length of C
- Recall = Recall = LCS(R, C) / Length of R
Model Evaluation Summary
Model Name | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score |
---|---|---|---|---|---|
LLaMA-2-7B-HF | 18215.61 | 1.84e-01 | 0.715 | 0.7257 | 0.0821 |
baby-llama-58m | 57.20 | 2.73e-06 | 0.025 | 0.6556 | 0.0097 |
DistilLlama | 77.12 | 7.79e-04 | 0.02 | 0.6623 | 0.0115 |
DistilLlamaV1 | 78.46 | 8.49e-04 | 0.065 | 0.6776 | 0.0135 |
Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation
GitHub Repositories
- Training Repo: DistilLlama Training Repository
- Evaluation Repo: Knowledge Distillation Evaluation Repository
Reference
@misc{timiryasov2023babyllamaknowledgedistillation, title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, author={Inar Timiryasov and Jean-Loup Tastet}, year={2023}, eprint={2308.02019}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.02019}, }
Note: The repository will be updated as training progresses. Last update 2024-11-06