thomwolf HF staff commited on
Commit
ac80290
1 Parent(s): 443873d
Files changed (4) hide show
  1. dist/bibliography.bib +6 -1
  2. dist/index.html +62 -52
  3. src/bibliography.bib +6 -1
  4. src/index.html +62 -52
dist/bibliography.bib CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  @inproceedings{barbaresi-2021-trafilatura,
2
  title = {Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction},
3
  author = "Barbaresi, Adrien",
@@ -28,7 +33,7 @@
28
  year={2016}
29
  }
30
  @misc{penedo2024datatrove,
31
- author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Wolf, Thomas and Sasko, Mario},
32
  title = {DataTrove: large scale data processing},
33
  year = {2024},
34
  publisher = {GitHub},
 
1
+ @article{radford2019language,
2
+ title={Language Models are Unsupervised Multitask Learners},
3
+ author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
4
+ year={2019}
5
+ }
6
  @inproceedings{barbaresi-2021-trafilatura,
7
  title = {Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction},
8
  author = "Barbaresi, Adrien",
 
33
  year={2016}
34
  }
35
  @misc{penedo2024datatrove,
36
+ author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},
37
  title = {DataTrove: large scale data processing},
38
  year = {2024},
39
  publisher = {GitHub},
dist/index.html CHANGED
@@ -179,39 +179,43 @@
179
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
180
  (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
182
- <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
183
- <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
184
- <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes: <strong>1.3 trillion (very high quality) and 5.4 trillion (high quality) tokens</strong>. 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
 
185
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
186
- <p>Both datasets are released under the permissive <strong><a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></strong></p>
187
 
188
  <p>As 🍷 FineWeb has gathered a lot of interest from the
189
- community, we decided to further explain the steps involved in creating it, our processing decisions and
190
- some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
191
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
192
- recipe (listing and explaining all of our design choices), and the process followed to create 📚 FineWeb-Edu.</p>
193
 
194
  <h2>General considerations on web data</h2>
195
  <h3>Sourcing the data</h3>
196
- <p>A common question we see asked regarding web datasets used
197
- to train LLMs is “where do they even get all that data?” There are generally two options:</p>
198
  <ul>
199
- <li>you either crawl it yourself, like <a
200
- href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
201
- href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
202
  </li>
203
  </ul>
204
  <ul>
205
  <li>you use a public repository of crawled webpages, like the one maintained by
206
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
207
  </ul>
208
- <p>For 🍷 FineWeb, similarly to what was done for a large number
209
- of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
210
- They have been crawling the web since 2007 (long before LLMs became widespread) and release a new dump usually
211
- every 1 or 2 months, which can be freely downloaded. </p>
212
- <p>As an example, their latest crawl (2024-18) contains 2.7
213
- billion web pages, totaling 386 TiB of uncompressed HTML text content (the size changes from dump to dump). There
214
- are 96 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
 
 
 
215
  <h3>Processing at scale</h3>
216
  <p>Given the sheer size of the data involved, one of the main
217
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
@@ -221,55 +225,61 @@
221
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
222
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
223
  CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
224
- href="https://github.com/huggingface/datatrove">library</a>.</p>
225
- <h3>What is clean, good data?</h3>
 
 
226
  <p>This is probably the main question to keep in mind when
227
- creating a dataset. In the context of large language model pretraining, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and often not a property of documents that can be easily perceived through direct observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
228
- <p>It is still common to train a model on a given corpus
229
- (wikipedia, or some other web dataset considered clean) and use it to check the perplexity on the dataset
230
- that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with performance on downstream
231
- tasks<d-cite bibtex-key="soldaini2024dolma"></d-cite>, and so another often used approach is to train small models (small because training models is
232
- expensive and time consuming, and we want to be able to quickly iterate) on a representative subset of our dataset and evaluate them on
233
- a set of evaluation tasks. As we are curating a dataset for pretraining a generalist LLM, it is important to
234
- choose a diverse set of tasks and try not to overfit to any one individual benchmark.</p>
235
- <p>Another way to evaluate different datasets would be to
236
- train a model on each one and have humans rate and compare their outputs (like on the <a
237
  href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>)<d-cite bibtex-key="chiang2024chatbot"></d-cite>. This would arguably provide the most
238
- reliable results in terms of representing real model usage, but getting ablation results this way is too
239
- expensive and slow. It also often requires that the models have undergone at least an instruction finetuning stage, as pretrained models have difficulty following instructions.<d-cite bibtex-key="ouyang2022training"></d-cite></p>
240
- <p>The approach we ultimately went with was to train small
241
- models and evaluate them on a set of benchmark tasks. We believe this is a reasonable proxy for the quality
242
- of the data used to train these models.</p>
243
  <h3>Ablations and evaluation setup</h3>
244
  <p>To be able to compare the impact of a given processing
245
- step, we would train 2 models, one where the data included the extra step and another where this step was
246
- ablated (cut/removed). These 2 models would have the same number of parameters, architecture, and be trained
247
- on an equal number of randomly sampled tokens from each step's data, for a single epoch, and with the same hyperparameters — the only difference would be in the
248
- training data. We would then evaluate each model on the same set of tasks and compare the average
249
  scores.</p>
250
- <p>Our ablation models were trained using <a
251
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
252
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. The models had 1.82B parameters, used the Llama
253
- architecture with a 2048 sequence length, and a global batch size of ~2 million tokens. For filtering
254
- ablations we mostly trained on ~28B tokens (which is roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
255
- model size).</p>
256
  <p>We evaluated the models using <a
257
- href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
258
- benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
259
- billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
260
  <ul>
261
  <li>small variance between runs trained on different samplings of the same
262
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
263
- resulting scores to have as little evaluation noise as possible
264
  </li>
265
  </ul>
266
  <ul>
267
  <li>performance increasing monotonically (or close) over a training run:
268
- ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
269
  (which would be indicative of unreliable results at a small scale)
270
  </li>
271
  </ul>
272
- <p>We selected the following list of benchmarks:</p>
 
 
 
 
273
  <ul>
274
  <li>CommonSense QA<d-cite bibtex-key="talmor-etal-2019-commonsenseqa"></d-cite></li>
275
  <li>HellaSwag<d-cite bibtex-key="zellers-etal-2019-hellaswag"></d-cite></li>
@@ -281,7 +291,7 @@
281
  <li>MMLU<d-cite bibtex-key="hendrycks2021measuring"></d-cite></li>
282
  </ul>
283
  <p>To
284
- have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
285
  min on a single node of 8 GPUs - done in parallel to the training).</p>
286
  <aside>You can find the full list of tasks and prompts we used <a
287
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
 
179
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
180
  (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
182
+ <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset.
183
+ However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
184
+ <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
185
+ <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
186
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
187
+ <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
188
 
189
  <p>As 🍷 FineWeb has gathered a lot of interest from the
190
+ community, we decided to explain in all details the steps involved in creating it as well as our processing decisions and
191
+ many lessons learned along the way. Hence the present (lengthy) technical report. Read on for all the juicy details on large text dataset creation!</p>
192
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
193
+ recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset as well.</p>
194
 
195
  <h2>General considerations on web data</h2>
196
  <h3>Sourcing the data</h3>
197
+ <p>A common question often asked regarding web datasets used
198
+ to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
199
  <ul>
200
+ <li>you either crawl it yourself, like companies like OpenAI or Anthropic (among other) do (see hints <a
201
+ href="https://platform.openai.com/docs/gptbot">here</a> and <a
202
+ href="https://darkvisitors.com/agents/claudebot">here</a>)
203
  </li>
204
  </ul>
205
  <ul>
206
  <li>you use a public repository of crawled webpages, like the one maintained by
207
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
208
  </ul>
209
+ <p>To build and filter 🍷 FineWeb, following what had done in the past by number of LLM training teams,
210
+ we used <a href="https://commoncrawl.org/">CommonCrawl</a> (CC) as a starting point.
211
+ The Common Crawl non–profit organization has been crawling the web since 2007 and
212
+ release a new crawl containing 200 to 300 TB of textual content obtained via automatic web crawling usually
213
+ every 1 or 2 months. </p>
214
+ <p>As an example, the latest CC crawl (id 2024-18) contains 2.7
215
+ billion web pages, totaling 386 TiB of uncompressed HTML text content (Note that the size changes from dump to dump).
216
+ Ninety-six crawls have been released since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.
217
+ <d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
218
+
219
  <h3>Processing at scale</h3>
220
  <p>Given the sheer size of the data involved, one of the main
221
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
 
225
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
226
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
227
  CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
228
+ href="https://github.com/huggingface/datatrove">library</a>. In most cases, you'll find the exact same scripts we used in the <a
229
+ href="https://github.com/huggingface/datatrove"><code>datatrove</code></a> repository.</p>
230
+
231
+ <h3>What is good data?</h3>
232
  <p>This is probably the main question to keep in mind when
233
+ creating a dataset. In most context and in particular in the context of large language model pretraining <d-footnote>Note that all our discussion in this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens) used to pretrained a Large Language Models (by pretraining we mean the very first step of training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field beside this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct, human, observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
234
+ <p>It is still common to train a model on a given corpus considered "clean"
235
+ (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
236
+ that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream
237
+ tasks of interest<d-cite bibtex-key="soldaini2024dolma"></d-cite>, and as a result another often used approach is to train small models<d-footnote>"Small" in comparison to standard sizes of today's LLM, i.e. small in comparison to 7-70 billion parameters. In this work "small" means about 1-2 billion parameters</d-footnote> on a representative subset of our dataset and evaluate them on
238
+ a set of evaluation tasks. The reason small model are used is because training models is
239
+ expensive and time consuming as a function of model size. In this second approach, it is important to
240
+ choose a diverse and representative set of dataset-evaluation tasks and try not to overfit to any one individual benchmark as it would risk hurting the generality of the obtained LLM after pretraining.</p>
241
+ <p>Yet another way to compare different datasets would be to
242
+ train a model on each dataset and have humans rate and compare the generations of the models (like on the <a
243
  href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>)<d-cite bibtex-key="chiang2024chatbot"></d-cite>. This would arguably provide the most
244
+ reliable results in terms of representing real model usage, but getting ablation results this way is unfortunately
245
+ expensive and slow. It also often requires for the models to have undergone through an instruction finetuning stage to acquire conversational capabilities, as pretrained models are not directly-designed to follow instructions and thus much more sensitive to prompt details.<d-cite bibtex-key="ouyang2022training"></d-cite></p>
246
+ <p>In this work, we went with the approach of training small
247
+ models and evaluating them on a set of "early-signal" benchmark tasks. We believe this is a reasonable proxy for the quality
248
+ of the data used to train these models with the above mentioned caveat around overfitting on the evaluation benchmarks.</p>
249
  <h3>Ablations and evaluation setup</h3>
250
  <p>To be able to compare the impact of a given processing
251
+ step, we typically train two models on two versions of the dataset, one version processed with the extra –ablated– step and another version with this step
252
+ ablated (cut/removed). Apart from the data, these two models are otherwise identical: same number of parameters, architecture hyper-parameters, and are trained
253
+ on an equal number of randomly sampled tokens from each version of the data, for a single epoch — the only difference being thus the
254
+ training data. We then evaluate each model on the same set of tasks and compare average
255
  scores.</p>
256
+ <p>Our ablation models are trained using <a
257
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
258
+ INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Ablation models have 1.82B parameters (including embeddings), used the Llama
259
+ architecture with a 2048 sequence length, a global batch size of ~2 million tokens and GPT2 tokenizer as mentioned above. For most
260
+ ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
261
+ model size). To confirm relative performance after several steps of filtering we conducted longer training runs in 350 billion tokens as mentioned further below.</p>
262
  <p>We evaluated the models using <a
263
+ href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
264
+ benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
265
+ billion" tokens). We generally used the following criteria to select these benchmarks among all the benchmarks available in <code>lighteval</code>:</p>
266
  <ul>
267
  <li>small variance between runs trained on different samplings of the same
268
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
269
+ resulting scores to have as little evaluation noise as possible and sensitive to exact data choice (apart from larger ablation that we are concerned with)
270
  </li>
271
  </ul>
272
  <ul>
273
  <li>performance increasing monotonically (or close) over a training run:
274
+ ideally, as the number of seen tokens increases, the performance of a high-signal benchmark should not decrease
275
  (which would be indicative of unreliable results at a small scale)
276
  </li>
277
  </ul>
278
+ <ul>
279
+ <li>performance above the random noise level with a few standard deviations at least. Given our small ablation models and trainings we usually don't reach extremely high scores on any benchmark, but we want to make sure that the scores we get are above random noise.
280
+ </li>
281
+ </ul>
282
+ <p>After consideration, we selected the following list of benchmarks:</p>
283
  <ul>
284
  <li>CommonSense QA<d-cite bibtex-key="talmor-etal-2019-commonsenseqa"></d-cite></li>
285
  <li>HellaSwag<d-cite bibtex-key="zellers-etal-2019-hellaswag"></d-cite></li>
 
291
  <li>MMLU<d-cite bibtex-key="hendrycks2021measuring"></d-cite></li>
292
  </ul>
293
  <p>To
294
+ compute our checkpoint evaluation in a constrained time, we capped the longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
295
  min on a single node of 8 GPUs - done in parallel to the training).</p>
296
  <aside>You can find the full list of tasks and prompts we used <a
297
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
src/bibliography.bib CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  @inproceedings{barbaresi-2021-trafilatura,
2
  title = {Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction},
3
  author = "Barbaresi, Adrien",
@@ -28,7 +33,7 @@
28
  year={2016}
29
  }
30
  @misc{penedo2024datatrove,
31
- author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Wolf, Thomas and Sasko, Mario},
32
  title = {DataTrove: large scale data processing},
33
  year = {2024},
34
  publisher = {GitHub},
 
1
+ @article{radford2019language,
2
+ title={Language Models are Unsupervised Multitask Learners},
3
+ author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
4
+ year={2019}
5
+ }
6
  @inproceedings{barbaresi-2021-trafilatura,
7
  title = {Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction},
8
  author = "Barbaresi, Adrien",
 
33
  year={2016}
34
  }
35
  @misc{penedo2024datatrove,
36
+ author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},
37
  title = {DataTrove: large scale data processing},
38
  year = {2024},
39
  publisher = {GitHub},
src/index.html CHANGED
@@ -179,39 +179,43 @@
179
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
180
  (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
182
- <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
183
- <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
184
- <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes: <strong>1.3 trillion (very high quality) and 5.4 trillion (high quality) tokens</strong>. 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
 
185
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
186
- <p>Both datasets are released under the permissive <strong><a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></strong></p>
187
 
188
  <p>As 🍷 FineWeb has gathered a lot of interest from the
189
- community, we decided to further explain the steps involved in creating it, our processing decisions and
190
- some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
191
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
192
- recipe (listing and explaining all of our design choices), and the process followed to create 📚 FineWeb-Edu.</p>
193
 
194
  <h2>General considerations on web data</h2>
195
  <h3>Sourcing the data</h3>
196
- <p>A common question we see asked regarding web datasets used
197
- to train LLMs is “where do they even get all that data?” There are generally two options:</p>
198
  <ul>
199
- <li>you either crawl it yourself, like <a
200
- href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
201
- href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
202
  </li>
203
  </ul>
204
  <ul>
205
  <li>you use a public repository of crawled webpages, like the one maintained by
206
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
207
  </ul>
208
- <p>For 🍷 FineWeb, similarly to what was done for a large number
209
- of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
210
- They have been crawling the web since 2007 (long before LLMs became widespread) and release a new dump usually
211
- every 1 or 2 months, which can be freely downloaded. </p>
212
- <p>As an example, their latest crawl (2024-18) contains 2.7
213
- billion web pages, totaling 386 TiB of uncompressed HTML text content (the size changes from dump to dump). There
214
- are 96 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
 
 
 
215
  <h3>Processing at scale</h3>
216
  <p>Given the sheer size of the data involved, one of the main
217
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
@@ -221,55 +225,61 @@
221
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
222
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
223
  CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
224
- href="https://github.com/huggingface/datatrove">library</a>.</p>
225
- <h3>What is clean, good data?</h3>
 
 
226
  <p>This is probably the main question to keep in mind when
227
- creating a dataset. In the context of large language model pretraining, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and often not a property of documents that can be easily perceived through direct observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
228
- <p>It is still common to train a model on a given corpus
229
- (wikipedia, or some other web dataset considered clean) and use it to check the perplexity on the dataset
230
- that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with performance on downstream
231
- tasks<d-cite bibtex-key="soldaini2024dolma"></d-cite>, and so another often used approach is to train small models (small because training models is
232
- expensive and time consuming, and we want to be able to quickly iterate) on a representative subset of our dataset and evaluate them on
233
- a set of evaluation tasks. As we are curating a dataset for pretraining a generalist LLM, it is important to
234
- choose a diverse set of tasks and try not to overfit to any one individual benchmark.</p>
235
- <p>Another way to evaluate different datasets would be to
236
- train a model on each one and have humans rate and compare their outputs (like on the <a
237
  href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>)<d-cite bibtex-key="chiang2024chatbot"></d-cite>. This would arguably provide the most
238
- reliable results in terms of representing real model usage, but getting ablation results this way is too
239
- expensive and slow. It also often requires that the models have undergone at least an instruction finetuning stage, as pretrained models have difficulty following instructions.<d-cite bibtex-key="ouyang2022training"></d-cite></p>
240
- <p>The approach we ultimately went with was to train small
241
- models and evaluate them on a set of benchmark tasks. We believe this is a reasonable proxy for the quality
242
- of the data used to train these models.</p>
243
  <h3>Ablations and evaluation setup</h3>
244
  <p>To be able to compare the impact of a given processing
245
- step, we would train 2 models, one where the data included the extra step and another where this step was
246
- ablated (cut/removed). These 2 models would have the same number of parameters, architecture, and be trained
247
- on an equal number of randomly sampled tokens from each step's data, for a single epoch, and with the same hyperparameters — the only difference would be in the
248
- training data. We would then evaluate each model on the same set of tasks and compare the average
249
  scores.</p>
250
- <p>Our ablation models were trained using <a
251
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
252
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. The models had 1.82B parameters, used the Llama
253
- architecture with a 2048 sequence length, and a global batch size of ~2 million tokens. For filtering
254
- ablations we mostly trained on ~28B tokens (which is roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
255
- model size).</p>
256
  <p>We evaluated the models using <a
257
- href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
258
- benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
259
- billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
260
  <ul>
261
  <li>small variance between runs trained on different samplings of the same
262
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
263
- resulting scores to have as little evaluation noise as possible
264
  </li>
265
  </ul>
266
  <ul>
267
  <li>performance increasing monotonically (or close) over a training run:
268
- ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
269
  (which would be indicative of unreliable results at a small scale)
270
  </li>
271
  </ul>
272
- <p>We selected the following list of benchmarks:</p>
 
 
 
 
273
  <ul>
274
  <li>CommonSense QA<d-cite bibtex-key="talmor-etal-2019-commonsenseqa"></d-cite></li>
275
  <li>HellaSwag<d-cite bibtex-key="zellers-etal-2019-hellaswag"></d-cite></li>
@@ -281,7 +291,7 @@
281
  <li>MMLU<d-cite bibtex-key="hendrycks2021measuring"></d-cite></li>
282
  </ul>
283
  <p>To
284
- have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
285
  min on a single node of 8 GPUs - done in parallel to the training).</p>
286
  <aside>You can find the full list of tasks and prompts we used <a
287
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
 
179
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
180
  (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
182
+ <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset.
183
+ However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
184
+ <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
185
+ <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
186
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
187
+ <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
188
 
189
  <p>As 🍷 FineWeb has gathered a lot of interest from the
190
+ community, we decided to explain in all details the steps involved in creating it as well as our processing decisions and
191
+ many lessons learned along the way. Hence the present (lengthy) technical report. Read on for all the juicy details on large text dataset creation!</p>
192
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
193
+ recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset as well.</p>
194
 
195
  <h2>General considerations on web data</h2>
196
  <h3>Sourcing the data</h3>
197
+ <p>A common question often asked regarding web datasets used
198
+ to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
199
  <ul>
200
+ <li>you either crawl it yourself, like companies like OpenAI or Anthropic (among other) do (see hints <a
201
+ href="https://platform.openai.com/docs/gptbot">here</a> and <a
202
+ href="https://darkvisitors.com/agents/claudebot">here</a>)
203
  </li>
204
  </ul>
205
  <ul>
206
  <li>you use a public repository of crawled webpages, like the one maintained by
207
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
208
  </ul>
209
+ <p>To build and filter 🍷 FineWeb, following what had done in the past by number of LLM training teams,
210
+ we used <a href="https://commoncrawl.org/">CommonCrawl</a> (CC) as a starting point.
211
+ The Common Crawl non–profit organization has been crawling the web since 2007 and
212
+ release a new crawl containing 200 to 300 TB of textual content obtained via automatic web crawling usually
213
+ every 1 or 2 months. </p>
214
+ <p>As an example, the latest CC crawl (id 2024-18) contains 2.7
215
+ billion web pages, totaling 386 TiB of uncompressed HTML text content (Note that the size changes from dump to dump).
216
+ Ninety-six crawls have been released since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.
217
+ <d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
218
+
219
  <h3>Processing at scale</h3>
220
  <p>Given the sheer size of the data involved, one of the main
221
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
 
225
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
226
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
227
  CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
228
+ href="https://github.com/huggingface/datatrove">library</a>. In most cases, you'll find the exact same scripts we used in the <a
229
+ href="https://github.com/huggingface/datatrove"><code>datatrove</code></a> repository.</p>
230
+
231
+ <h3>What is good data?</h3>
232
  <p>This is probably the main question to keep in mind when
233
+ creating a dataset. In most context and in particular in the context of large language model pretraining <d-footnote>Note that all our discussion in this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens) used to pretrained a Large Language Models (by pretraining we mean the very first step of training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field beside this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct, human, observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
234
+ <p>It is still common to train a model on a given corpus considered "clean"
235
+ (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
236
+ that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream
237
+ tasks of interest<d-cite bibtex-key="soldaini2024dolma"></d-cite>, and as a result another often used approach is to train small models<d-footnote>"Small" in comparison to standard sizes of today's LLM, i.e. small in comparison to 7-70 billion parameters. In this work "small" means about 1-2 billion parameters</d-footnote> on a representative subset of our dataset and evaluate them on
238
+ a set of evaluation tasks. The reason small model are used is because training models is
239
+ expensive and time consuming as a function of model size. In this second approach, it is important to
240
+ choose a diverse and representative set of dataset-evaluation tasks and try not to overfit to any one individual benchmark as it would risk hurting the generality of the obtained LLM after pretraining.</p>
241
+ <p>Yet another way to compare different datasets would be to
242
+ train a model on each dataset and have humans rate and compare the generations of the models (like on the <a
243
  href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>)<d-cite bibtex-key="chiang2024chatbot"></d-cite>. This would arguably provide the most
244
+ reliable results in terms of representing real model usage, but getting ablation results this way is unfortunately
245
+ expensive and slow. It also often requires for the models to have undergone through an instruction finetuning stage to acquire conversational capabilities, as pretrained models are not directly-designed to follow instructions and thus much more sensitive to prompt details.<d-cite bibtex-key="ouyang2022training"></d-cite></p>
246
+ <p>In this work, we went with the approach of training small
247
+ models and evaluating them on a set of "early-signal" benchmark tasks. We believe this is a reasonable proxy for the quality
248
+ of the data used to train these models with the above mentioned caveat around overfitting on the evaluation benchmarks.</p>
249
  <h3>Ablations and evaluation setup</h3>
250
  <p>To be able to compare the impact of a given processing
251
+ step, we typically train two models on two versions of the dataset, one version processed with the extra –ablated– step and another version with this step
252
+ ablated (cut/removed). Apart from the data, these two models are otherwise identical: same number of parameters, architecture hyper-parameters, and are trained
253
+ on an equal number of randomly sampled tokens from each version of the data, for a single epoch — the only difference being thus the
254
+ training data. We then evaluate each model on the same set of tasks and compare average
255
  scores.</p>
256
+ <p>Our ablation models are trained using <a
257
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
258
+ INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Ablation models have 1.82B parameters (including embeddings), used the Llama
259
+ architecture with a 2048 sequence length, a global batch size of ~2 million tokens and GPT2 tokenizer as mentioned above. For most
260
+ ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
261
+ model size). To confirm relative performance after several steps of filtering we conducted longer training runs in 350 billion tokens as mentioned further below.</p>
262
  <p>We evaluated the models using <a
263
+ href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We carefully selected a set of benchmark for ablations by selecting
264
+ benchmarks that would provide good signal at a relatively small scale ("small" models trained on only "a few
265
+ billion" tokens). We generally used the following criteria to select these benchmarks among all the benchmarks available in <code>lighteval</code>:</p>
266
  <ul>
267
  <li>small variance between runs trained on different samplings of the same
268
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
269
+ resulting scores to have as little evaluation noise as possible and sensitive to exact data choice (apart from larger ablation that we are concerned with)
270
  </li>
271
  </ul>
272
  <ul>
273
  <li>performance increasing monotonically (or close) over a training run:
274
+ ideally, as the number of seen tokens increases, the performance of a high-signal benchmark should not decrease
275
  (which would be indicative of unreliable results at a small scale)
276
  </li>
277
  </ul>
278
+ <ul>
279
+ <li>performance above the random noise level with a few standard deviations at least. Given our small ablation models and trainings we usually don't reach extremely high scores on any benchmark, but we want to make sure that the scores we get are above random noise.
280
+ </li>
281
+ </ul>
282
+ <p>After consideration, we selected the following list of benchmarks:</p>
283
  <ul>
284
  <li>CommonSense QA<d-cite bibtex-key="talmor-etal-2019-commonsenseqa"></d-cite></li>
285
  <li>HellaSwag<d-cite bibtex-key="zellers-etal-2019-hellaswag"></d-cite></li>
 
291
  <li>MMLU<d-cite bibtex-key="hendrycks2021measuring"></d-cite></li>
292
  </ul>
293
  <p>To
294
+ compute our checkpoint evaluation in a constrained time, we capped the longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
295
  min on a single node of 8 GPUs - done in parallel to the training).</p>
296
  <aside>You can find the full list of tasks and prompts we used <a
297
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>