Edit model card

Overview

This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher. This version is the latest version of DistilLlama, which has gone through 5 days of training using two Nvidia A100 80G GPU.

Update

30 out of 300 checkpoints were examined, and the one with the best performance in semantic and factual accuracy has now been updated in this repository.

Model Architecture

The architecture is based on LLaMA 2, with the following parameters:

Parameter Value
Hidden Dimension 512
Intermediate Dimension 1536
Max Positional Embeddings 128
Attention Heads 8
Transformer Layers 16

Evaluation Metrics

  1. Cosine Similarity using Word Embeddings

    • Description: Measures semantic similarity by mapping words/phrases to vectors.
    • Equation: Cosine Similarity = ( A • B ) / ( ||A|| ||B|| )
    • Example: "The dog chased the cat." vs. "A canine pursued a feline." (High similarity)
  2. Exact Match (EM)

    • Description: Checks if critical keywords are present.
    • Example:
      • Expected: "Paris"
      • Response: "The capital of France is Paris." (EM = 1)
  3. ROUGE Score

    • Description: Measures the overlap of the longest common subsequences between reference and response texts.
    • Equation:
      • Precision = Precision = LCS(R, C) / Length of C
      • Recall = Recall = LCS(R, C) / Length of R

Model Evaluation Summary

Model Name Duration (s) Emissions (kgCO₂e) Avg. EM Avg. Cosine Similarity Avg. ROUGE Score
LLaMA-2-7B-HF 18215.61 1.84e-01 0.715 0.7257 0.0821
baby-llama-58m 57.20 2.73e-06 0.025 0.6556 0.0097
DistilLlama 77.12 7.79e-04 0.02 0.6623 0.0115
DistilLlamaV1 78.46 8.49e-04 0.065 0.6776 0.0135

Note: CodeCarbon was used to track carbon emission. Allocated 80GB memory, 32 cores, Intel(R) Xeon(R) Gold 6448H for the evaluation

GitHub Repositories

Reference

@misc{timiryasov2023babyllamaknowledgedistillation, title={Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty}, author={Inar Timiryasov and Jean-Loup Tastet}, year={2023}, eprint={2308.02019}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2308.02019}, }

Note: The repository will be updated as training progresses. Last update 2024-11-06

Downloads last month
40
Safetensors
Model size
87.3M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train HenryHHHH/DistilLlamaV1