Spaces:

HuggingFaceFW
/

README

Running

README / README.md

guipenedo HF staff

Update README.md

1c03947 verified 6 months ago

1.24 kB

	---
	title: README
	emoji: 👀
	colorFrom: purple
	colorTo: pink
	sdk: static
	pinned: false
	---

	# 🤗 HuggingFace 🍷 FineWeb datasets
	_Read our [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)!_

	This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)).

	The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.

	All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).

	Version 1 of the 🍷 FineWeb dataset is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Our ablation models can be found [here](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32).