Model Card for DCLM-1B

DCLM-1B is a 1.4 billion parameter language model trained on the DCLM-Baseline dataset, which was curated as part of the DataComp for Language Models (DCLM) benchmark. This model is designed to showcase the effectiveness of systematic data curation techniques for improving language model performance.

The instruction tuned version of this model is available here: https://huggingface.co/TRI-ML/DCLM-1B-IT

Quickstart

First install open_lm

pip install git+https://github.com/mlfoundations/open_lm.git

Then you can load the model using HF's Auto classes as follows:

from open_lm.hf import *
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("TRI-ML/DCLM-1B")
model = AutoModelForCausalLM.from_pretrained("TRI-ML/DCLM-1B")

inputs = tokenizer(["Machine learning is"], return_tensors="pt")
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)

Evaluation

We evaluate DCLM-1B using the llm-foundry eval suite, and compare to recently released small models on key benchmarks. As described in the paper, Core accuracy is the average of centered accuracy on 22 tasks (including HellaSwag and ARC-E), Extended is centered accuracy averaged over 53 tasks.

Model	Params	Tokens	Open dataset?	Core	MMLU 5-shot	Extended
Open weights, closed datasets
Qwen2-1.5B	1.5B	7T	❌	42.1	56.4	32.4
Gemma-2B	2.5B	3T	❌	43.3	40.8	26.6
Open weights, open datasets
OLMo-1B	1.2B	3T	✅	29.7	26.0	16.1
SmolLM	1.7B	1T	✅	36.3	30.0	21.2
DCLM-1B	1.4B	4.3T	✅	45.2	47.5	28.1

Model Details

Size	Training Tokens	Layers	Hidden Size	Attention Heads	Context Length
1.4B	4.3T	24	2048	16	2048

Model Description

Developed by: DataComp for Language Models (DCLM) Team
Model type: Decoder-only Transformer language model
Language(s): English (primarily)
License: Apache 2.0
Contact: [email protected]
Date: July 2024

Model Sources

Repository: https://github.com/mlfoundations/dclm
Dataset: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
Paper: DataComp-LM: In search of the next generation of training sets for language models

Training Details

The model was trained using the following setup:

Architecture: Decoder-only Transformer
Framework: PyTorch with OpenLM
Optimizer: AdamW
Learning Rate: 1e-2 (peak)
Weight Decay: 1e-2
Batch Size: 2048 sequences
Sequence Length: 2048 tokens
Total Training Tokens: 4.3T
Hardware: Trained on H100 GPUs

We train our 1.4B model for 4.3T tokens on DCLM-Baseline, combined with the StarCoder and ProofPile2 datasets. We will update our paper soon with more training details.

Detailed evaluation

Task	Score
AGI Eval LSAT AR	0.2652
AGI Eval LSAT LR	0.3314
AGI Eval LSAT RC	0.4179
AGI Eval SAT English	0.4709
AGI Eval SAT Math (CoT)	0.0318
AQuA (CoT)	0.0245
ARC (challenge)	0.4744
ARC (easy)	0.7462
BBQ	0.5151
BigBench Conceptual Combinations	0.5437
BigBench Conlang Translation	0.0793
BigBench CS Algorithms	0.4720
BigBench Dyck Languages	0.2210
BigBench Elementary Math QA	0.2598
BigBench Language Identification	0.3284
BigBench Logical Deduction	0.2473
BigBench Misconceptions	0.5662
BigBench Novel Concepts	0.5000
BigBench Operators	0.3476
BigBench QA Wikidata	0.6852
BigBench Repeat Copy Logic	0.1250
BigBench Strange Stories	0.6724
BigBench Strategy QA	0.5671
BigBench Understanding Fables	0.4603
BoolQ	0.7382
CommonSenseQA	0.6708
COPA	0.8200
CoQA	0.4314
Enterprise PII Classification	0.5246
GPQA Diamond	0.2424
GPQA	0.2500
GSM8K (CoT)	0.0629
HellaSwag	0.7285
HellaSwag (zero-shot)	0.7162
Jeopardy	0.4514
LAMBADA (OpenAI)	0.6992
LogiQA	0.3103
MathQA	0.2682
MMLU (few-shot)	0.4752
MMLU (zero-shot)	0.4175
OpenBookQA	0.4280
PIQA	0.7829
PubMedQA (labeled)	0.3790
Simple Arithmetic (no spaces)	0.0650
Simple Arithmetic (with spaces)	0.0700
SIQA	0.6868
SQuAD	0.5494
SVAMP (CoT)	0.2733
TriviaQA (small subset)	0.4133
Winogender (MC female)	0.4667
Winogender (MC male)	0.4000
Winograd	0.8608
Winogrande	0.6630

Limitations and Biases

While DCLM-1B demonstrates strong performance across a range of tasks, it's important to note:

The model may exhibit biases present in its training data, which is derived from web crawl data.
It has not undergone specific alignment or safety fine-tuning, so outputs should be used with caution.
Performance on tasks not included in the evaluation suite may vary.
The model's knowledge is limited to its training data cutoff date.

Ethical Considerations

Users should be aware that this model, like all large language models, can potentially generate harmful or biased content. It should not be used for making decisions about individuals or in sensitive applications without appropriate safeguards and human oversight.

Citation

If you use this model in your research, please cite:

@article{Li2024DataCompLM,
  title={DataComp-LM: In search of the next generation of training sets for language models},
  author={Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and [... full author list]},
  journal={arXiv preprint arXiv:2406.11794},
  year={2024}
}

TRI-ML
/

DCLM-1B