data - a leonardlin Collection

leonardlin 's Collections

speed

sota

evals

tuning

rag

context

safety

image

vision

code

prompt injection

TOREAD

data

voice

data

updated Aug 17

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Paper • 2305.13169 • Published May 22, 2023 • 3
A Survey on Data Selection for Language Models

Paper • 2402.16827 • Published Feb 26 • 4
HuggingFaceFW/fineweb-edu

Viewer • Updated Oct 11 • 3B • 624k • 543
allenai/MADLAD-400

Updated Sep 9 • 83.8k • 128
uonlp/CulturaX

Viewer • Updated Jul 23 • 7.18B • 10.6k • 475
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 95
DDK: Distilling Domain Knowledge for Efficient Large Language Models

Paper • 2407.16154 • Published Jul 23 • 21
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Paper • 2408.02085 • Published Aug 4 • 17
Better Alignment with Instruction Back-and-Forth Translation

Paper • 2408.04614 • Published Aug 8 • 14
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

Paper • 2408.08291 • Published Aug 15 • 10