smol_llama-220M-GQA-32k-linear

Experimental model meant to serve as a long-context speculative decoding model.

Created using BEE-spoke-data/smol_llama-220M-GQA and further pretraining at 32768 context length on togethercomputer/RedPajama-Data-1T-Sample.

This variant uses the linear rope scaling method for context extension.

Wikitext Perplexity (64 rows) as evaluated by exllamav2:

Base Model
2048: 20.2193
4096: 102.6928
8192: 235.5210
16384: 390.7198
32768: 515.8053

32k - Linear Rope Scale 16.0
2048: 25.7148
4096: 23.4461
8192: 22.3326
16384: 21.6744
32768: 21.4317

32k - Rope Theta 1000000.0
2048: 20.2158
4096: 18.3868
8192: 17.5976
16384: 17.1462
32768: 16.6989

Doctor-Shotgun
/

smol_llama-220M-GQA-32k-linear

smol_llama-220M-GQA-32k-linear

Dataset used to train Doctor-Shotgun/smol_llama-220M-GQA-32k-linear

Collection including Doctor-Shotgun/smol_llama-220M-GQA-32k-linear

LLM Speculative Decoding