Guilherme Penedo's picture

Guilherme Penedo

guipenedo

·

AI & ML interests

None yet

Organizations

guipenedo's activity

New activity in HuggingFaceFW/fineweb 2 days ago

MMLU evaluation setting

#44 opened 4 months ago by

New activity in HuggingFaceFW/fineweb-edu 6 days ago

Delete data/CC-MAIN-2013-20/train-00001-of-00014.parquet

#16 opened 7 days ago by

New activity in HuggingFaceFW/fineweb about 2 months ago

350BT sample is much smaller than advertized

#53 opened 3 months ago by

New activity in HuggingFaceFW/fineweb 3 months ago

Exact copy of this dataset on HuggingFace yields "This dataset has 218 files that have been marked as unsafe."

#50 opened 3 months ago by

Update README.md

#51 opened 3 months ago by

Simple exact deduplication removes 2/3 of data.

#49 opened 4 months ago by

New activity in HuggingFaceFW/fineweb-edu 3 months ago

Update README.md

#12 opened 4 months ago by

New activity in HuggingFaceFW/fineweb 4 months ago

Casting Issue?

#40 opened 5 months ago by

New activity in HuggingFaceFW/fineweb-edu 4 months ago

Resources on DataTrove ?

#10 opened 4 months ago by

New activity in HuggingFaceFW/fineweb 4 months ago

Is there an official test set for benchmarking objectively?

#42 opened 5 months ago by

Dataset Viewer issue: JobManagerCrashedError

#37 opened 5 months ago by

How to compute the aggerate score?

#35 opened 5 months ago by

fixing typo

#34 opened 6 months ago by

New activity in HuggingFaceFW/fineweb 5 months ago

Any plans to release warc content after the language filtering steps?

#41 opened 5 months ago by

Fineweb train configuration

#39 opened 5 months ago by

Reproducibility of the work for other languages

#38 opened 5 months ago by

New activity in HuggingFaceFW/ablation-model-fineweb-edu 5 months ago

Intermediate checkpoints

#1 opened 6 months ago by

New activity in HuggingFaceFW/fineweb 6 months ago

Language subset

#33 opened 6 months ago by

Dedup

#32 opened 6 months ago by

Reconsturcting sample versions

#31 opened 6 months ago by