Guilherme Penedo
guipenedo
AI & ML interests
None yet
Organizations
guipenedo's activity
350BT sample is much smaller than advertized
1
#53 opened about 1 month ago
by
DavidNemeskey
Exact copy of this dataset on HuggingFace yields "This dataset has 218 files that have been marked as unsafe."
1
#50 opened about 1 month ago
by
egor-pakhomov
Update README.md
1
#51 opened about 1 month ago
by
CryptoUranus
Simple exact deduplication removes 2/3 of data.
3
#49 opened about 2 months ago
by
egor-pakhomov
Update README.md
1
#12 opened 2 months ago
by
Jhonwik
Casting Issue?
4
#40 opened 3 months ago
by
FelixLabelle
Resources on DataTrove ?
1
#10 opened 3 months ago
by
alielfilali01
Is there an official test set for benchmarking objectively?
1
#42 opened 3 months ago
by
SophieOstmeier
Dataset Viewer issue: JobManagerCrashedError
1
#37 opened 3 months ago
by
nudelbrot
How to compute the aggerate score?
1
#35 opened 4 months ago
by
mornmirror
fixing typo
#34 opened 4 months ago
by
eliebak
Any plans to release warc content after the language filtering steps?
1
#41 opened 3 months ago
by
Splend1dchan
Fineweb train configuration
1
#39 opened 3 months ago
by
nezhazheng
Reproducibility of the work for other languages
2
#38 opened 3 months ago
by
camillop
Intermediate checkpoints
1
#1 opened 4 months ago
by
przvl
Language subset
1
#33 opened 4 months ago
by
talmor
Reconsturcting sample versions
2
#31 opened 4 months ago
by
maveriq
MMLU score recreation
2
#5 opened 4 months ago
by
theblackcat102
Recreating MMLU scores
2
#2 opened 4 months ago
by
theblackcat102
Sample dataset?
8
#23 opened 5 months ago
by
dweb
Details on the evaluation with lighteval
2
#22 opened 5 months ago
by
amaracani
Split by languages?
4
#7 opened 5 months ago
by
mhenrichsen
Thank you for the great dataset
#5 opened 5 months ago
by
musicurgy
Deduped dataset across all CC dumps or within each dump?
2
#26 opened 5 months ago
by
riturajj
Code Datasets
2
#30 opened 4 months ago
by
Taylor658
Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 4 months ago
by
kimcando
Scoring documents with LLM and making scores available as a quality filter (Ask-LLM)
4
#3 opened 5 months ago
by
Lauler
Update README.md
1
#27 opened 4 months ago
by
h0kkabaz
is this published dataset finished PII process too?
2
#20 opened 5 months ago
by
kimcando
FineWeb and Redpajamav2 deduplication
2
#24 opened 5 months ago
by
PereLluis13
Reproducing evaluation results
1
#3 opened 5 months ago
by
hjlee1371
Disk size of each dump
1
#15 opened 5 months ago
by
qbin
Download filtered dataset
1
#11 opened 5 months ago
by
Basma-b
Training configs for data ablation study
1
#14 opened 5 months ago
by
jimmyhbx
Intermediate checkpoints
2
#1 opened 5 months ago
by
przvl
FWPR
#13 opened 5 months ago
by
infinite85
Reprocessing for a new language
13
#12 opened 5 months ago
by
pere
add license tag so it can be submitted to open llm leaderboard
2
#2 opened 5 months ago
by
mrfakename
License
2
#1 opened 5 months ago
by
mrfakename
Compatibility with released datatrove version
2
#6 opened 5 months ago
by
stefan-it
Is it currently unsafe to download this dataset?
2
#10 opened 5 months ago
by
george-adams1
ARC-Challenge Benchmark
2
#2 opened 5 months ago
by
Taylor658
[bot] Conversion to Parquet
#1 opened 5 months ago
by
parquet-converter
fix typos
1
#2 opened 7 months ago
by
guipenedo
🚩 Report: Not working
15
#64 opened 9 months ago
by
Gertie01
🚩 Report: Not working
5
#62 opened 9 months ago
by
Gertie01
🚩 Report: Not working
5
#60 opened 10 months ago
by
Dinkum
🚩 Report: Not working
4
#57 opened 10 months ago
by
Spoon300
🚩 Report: Not working
7
#54 opened 10 months ago
by
Sksis
Update gradio to fix queue issue
#56 opened 10 months ago
by
guipenedo
🚩 Report: Not working
6
#46 opened 11 months ago
by
Dinkum
No-sense and indecipherable answers after a couple of questions
1
#51 opened 11 months ago
by
carlos-santos
Update tokenizer_config.json
#15 opened 11 months ago
by
Rocketknight1
🚩 Report: Not working
1
#48 opened 11 months ago
by
deleted
Add chat template
#14 opened 11 months ago
by
Rocketknight1
🚩 Report: Not working
7
#41 opened 12 months ago
by
someone2024
A long queue of over 1500 jobs
1
#39 opened 12 months ago
by
yikuan8
🚩 Report: Not working
4
#38 opened 12 months ago
by
heywhatsmyname