---
language: en
tags:
  - text-generation
  - knowledge-distillation
  - llama
  - causal-lm
  - openwebtext
  - wikitext
  - transfer-learning
model_name: DistilLLaMA
license: apache-2.0
datasets:
  - openwebtext
  - wikitext
parameter_count: 80M
metrics:
  - cosine-similarity
  - exact-match
  - rouge
library_name: transformers
base_model: meta-llama/LLaMA-2-7B
---


### Overview
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. 
It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. 
Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.

### Update
30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.

### Model Architecture

The architecture is based on LLaMA 2, with the following parameters:
| Parameter               | Value |
|-------------------------|-------|
| Hidden Dimension        | 512   |
| Intermediate Dimension  | 1536  |
| Max Positional Embeddings | 128   |
| Attention Heads         | 8     |
| Transformer Layers      | 16    |


### Evaluation Metrics

1. **Cosine Similarity using Word Embeddings**
   - **Description**: Measures semantic similarity by mapping words/phrases to vectors.
   - **Equation**: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
   - **Example**: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)

2. **Exact Match (EM)**
   - **Description**: Checks if critical keywords are present.
   - **Example**: 
     - Expected: "Paris"
     - Response: "The capital of France is Paris." (EM = 1)

3. **ROUGE Score**
   - **Description**: Measures the overlap of the longest common subsequences between reference and response texts.
   - **Equation**:
     - Precision = Precision = LCS(R, C) / Length of C
     - Recall = Recall = LCS(R, C) / Length of R

### Model Evaluation Summary

| Model Name      | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score |
|-----------------|--------------|--------------------|---------|------------------------|------------------|
| LLaMA-2-7B-HF   | 18215.61     | 1.84e-01           | 0.715   | 0.7257                 | 0.0821           |
| baby-llama-58m  | 57.20        | 2.73e-06           | 0.025   | 0.6556                 | 0.0097           |
| DistilLlama     | 77.12        | 7.79e-04           | 0.02    | 0.6623                 | 0.0115           |
| DistilLlamaV1   | 78.46        | 8.49e-04           | 0.065   | 0.6776                 | 0.0135           |

*Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation*

### GitHub Repositories

- **Training Repo**: [DistilLlama Training Repository](https://github.com/HenryHuang2/DistilLlama)
- **Evaluation Repo**: [Knowledge Distillation Evaluation Repository](https://github.com/svarnim1805/Knowledge-Distillation)

### Reference

@misc{timiryasov2023babyllamaknowledgedistillation,
      title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, 
      author={Inar Timiryasov and Jean-Loup Tastet},
      year={2023},
      eprint={2308.02019},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2308.02019}, 
}

*Note: The repository will be updated as training progresses. Last update 2024-11-06*