![HuggingFaceFW's profile picture](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW
Enterprise
community
AI & ML interests
None defined yet.
Organization Card
About org cards
🤗 HuggingFace 🍷 FineWeb datasets
Read our technical report!
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove
, nanotron
or lighteval
.
Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.
Collections
4
models
29
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
•
Updated
•
3.38k
•
4
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/fineweb-edu-classifier
Text Classification
•
Updated
•
1.37M
•
70
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation
•
Updated
![](https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/EfmW5LH_nj0FCEZH7wH2p.png)
HuggingFaceFW/ablation-exp-filter-c4-tpunct-28BT
Text Generation
•
Updated
datasets
5
HuggingFaceFW/fineweb
Viewer
•
Updated
•
46B
•
31.9k
•
1.5k
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3B
•
217k
•
340
HuggingFaceFW/fineweb-edu-llama3-annotations
Viewer
•
Updated
•
467k
•
650
•
22
HuggingFaceFW/fineweb-edu-score-2
Viewer
•
Updated
•
11.8B
•
4.24k
•
32
HuggingFaceFW/admin
Viewer
•
Updated
•
2