stereoplegic
's Collections
Dataset curation
updated
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data
Selection for Instruction Tuning
Paper
•
2308.12032
•
Published
•
1
Know thy corpus! Robust methods for digital curation of Web corpora
Paper
•
2003.06389
•
Published
•
1
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
40
The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation
Paper
•
2305.06156
•
Published
•
2
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
•
2306.00424
•
Published
•
1
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and
Reasoning
Paper
•
2305.15486
•
Published
•
1
Pretraining task diversity and the emergence of non-Bayesian in-context
learning for regression
Paper
•
2306.15063
•
Published
•
1
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
•
2111.04130
•
Published
•
1
Oasis: Data Curation and Assessment System for Pretraining of Large
Language Models
Paper
•
2311.12537
•
Published
•
1
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
•
2403.00231
•
Published
•
1
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models
Paper
•
2402.03300
•
Published
•
69
Automated Data Curation for Robust Language Model Fine-Tuning
Paper
•
2403.12776
•
Published
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
•
2404.14361
•
Published
•
1
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based
Approach
Paper
•
2405.15613
•
Published
•
13
SemCoder: Training Code Language Models with Comprehensive Semantics
Paper
•
2406.01006
•
Published
Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
Paper
•
2305.12182
•
Published
•
1