TinyLlama-1.1B

The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.

We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

Releases Schedule

We will be rolling out intermediate checkpoints following the below schedule. We also include some baseline models for comparison.

Date	HF Checkpoint	Tokens	Step	HellaSwag Acc_norm
Baseline	StableLM-Alpha-3B	800B	--	38.31
Baseline	Pythia-1B-intermediate-step-50k-105b	105B	50k	42.04
Baseline	Pythia-1B	300B	143k	47.16
2023-09-04	TinyLlama-1.1B-intermediate-step-50k-105b	105B	50k	43.50
2023-09-16	--	500B	--	--
2023-10-01	--	1T	--	--
2023-10-16	--	1.5T	--	--
2023-10-31	--	2T	--	--
2023-11-15	--	2.5T	--	--
2023-12-01	--	3T	--	--

It can be observed that TinyLlama has so far progressed well 🎉🎉.

Meanwhile, you can track the live cross entropy loss here.

Training Details

Below are some details of our training setup:

Setting	Description
Parameters	1.1B
Attention Variant	Grouped Query Attention
Model Size	Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048, Intermediate Size (Swiglu): 5632
Sequence Length	2048
Batch Size	2 million tokens (2048 * 1024)
Learning Rate	4e-4
Learning Rate Schedule	Cosine with 2000 warmup steps
Training Data	Slimpajama & Starcoderdata
Data Preprocessing	Excluded GitHub subset of Slimpajama; Sampled all code from Starcoderdata
Combined Dataset Size	1 trillion tokens
Total Tokens During Training	3 trillion (3 epochs/1430k steps)
Natural Language to Code Ratio	7:3
Hardware	16 A100-40G GPUs