guipenedo HF staff commited on
Commit
f2ea008
β€’
1 Parent(s): 6ab6d43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -7,8 +7,12 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- # HuggingFace FineWeb datasets
 
11
 
12
- This is the home of the FineWeb datasets, a series of web (CommonCrawl) text datasets filtered and processed for pretraining highly performant large language models.
13
 
14
- Currently releasing v1
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # πŸ€— HuggingFace 🍷 FineWeb datasets
11
+ This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)).
12
 
13
+ The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
14
 
15
+ All code and artefacts needed for reproduction are public and built on top of open source libraries, like the πŸ€— libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).
16
+
17
+
18
+ _Currently releasing v1_