--- language: en tags: - text-generation - knowledge-distillation - llama - causal-lm - openwebtext - wikitext - transfer-learning model_name: DistilLLaMA license: apache-2.0 datasets: - openwebtext - wikitext parameter_count: 80M metrics: - cosine-similarity - exact-match - rouge library_name: transformers base_model: meta-llama/LLaMA-2-7B --- ### Overview This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU. ### Update 30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository. ### Model Architecture The architecture is based on LLaMA 2, with the following parameters: | Parameter | Value | |-------------------------|-------| | Hidden Dimension | 512 | | Intermediate Dimension | 1536 | | Max Positional Embeddings | 128 | | Attention Heads | 8 | | Transformer Layers | 16 | ### Evaluation Metrics 1. **Cosine Similarity using Word Embeddings** - **Description**: Measures semantic similarity by mapping words/phrases to vectors. - **Equation**: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| ) - **Example**: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity) 2. **Exact Match (EM)** - **Description**: Checks if critical keywords are present. - **Example**: - Expected: "Paris" - Response: "The capital of France is Paris." (EM = 1) 3. **ROUGE Score** - **Description**: Measures the overlap of the longest common subsequences between reference and response texts. - **Equation**: - Precision = Precision = LCS(R, C) / Length of C - Recall = Recall = LCS(R, C) / Length of R ### Model Evaluation Summary | Model Name | Duration (s) | Emissions (kgCO₂e) | Avg. EM | Avg. Cosine Similarity | Avg. ROUGE Score | |-----------------|--------------|--------------------|---------|------------------------|------------------| | LLaMA-2-7B-HF | 18215.61 | 1.84e-01 | 0.715 | 0.7257 | 0.0821 | | baby-llama-58m | 57.20 | 2.73e-06 | 0.025 | 0.6556 | 0.0097 | | DistilLlama | 77.12 | 7.79e-04 | 0.02 | 0.6623 | 0.0115 | | DistilLlamaV1 | 78.46 | 8.49e-04 | 0.065 | 0.6776 | 0.0135 | *Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation* ### GitHub Repositories - **Training Repo**: [DistilLlama Training Repository](https://github.com/HenryHuang2/DistilLlama) - **Evaluation Repo**: [Knowledge Distillation Evaluation Repository](https://github.com/svarnim1805/Knowledge-Distillation) ### Reference @misc{timiryasov2023babyllamaknowledgedistillation, title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, author={Inar Timiryasov and Jean-Loup Tastet}, year={2023}, eprint={2308.02019}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.02019}, } *Note: The repository will be updated as training progresses. Last update 2024-11-06*