Guilherme Penedo
guipenedo
AI & ML interests
None yet
Organizations
guipenedo's activity
MMLU evaluation setting
1
#44 opened 4 months ago
by
jordane95
Delete data/CC-MAIN-2013-20/train-00001-of-00014.parquet
#16 opened 7 days ago
by
KomX
350BT sample is much smaller than advertized
1
#53 opened 3 months ago
by
DavidNemeskey
Exact copy of this dataset on HuggingFace yields "This dataset has 218 files that have been marked as unsafe."
1
#50 opened 3 months ago
by
egor-pakhomov
Update README.md
1
#51 opened 3 months ago
by
CryptoUranus
Simple exact deduplication removes 2/3 of data.
3
#49 opened 4 months ago
by
egor-pakhomov
Update README.md
1
#12 opened 4 months ago
by
Jhonwik
Casting Issue?
4
#40 opened 5 months ago
by
FelixLabelle
Resources on DataTrove ?
1
#10 opened 4 months ago
by
alielfilali01
Is there an official test set for benchmarking objectively?
1
#42 opened 5 months ago
by
SophieOstmeier
Dataset Viewer issue: JobManagerCrashedError
1
#37 opened 5 months ago
by
nudelbrot
How to compute the aggerate score?
1
#35 opened 5 months ago
by
mornmirror
fixing typo
#34 opened 6 months ago
by
eliebak
Any plans to release warc content after the language filtering steps?
1
#41 opened 5 months ago
by
Splend1dchan
Fineweb train configuration
2
#39 opened 5 months ago
by
nezhazheng
Reproducibility of the work for other languages
2
#38 opened 5 months ago
by
camillop
Intermediate checkpoints
1
#1 opened 6 months ago
by
przvl
Language subset
1
#33 opened 6 months ago
by
talmor
Reconsturcting sample versions
2
#31 opened 6 months ago
by
maveriq