Spaces:
Running
Running
File size: 1,241 Bytes
3a6e40d f2ea008 1c03947 f2ea008 4f8e3f3 f2ea008 4f8e3f3 9488a06 f2ea008 91d2686 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
---
title: README
emoji: π
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
---
# π€ HuggingFace π· FineWeb datasets
_Read our [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)!_
This organization hosts the π· FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)).
The creation of π· FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the π€ libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).
Version 1 of the π· FineWeb dataset is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Our ablation models can be found [here](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32).
|