Resources for Cosmopedia dataset
Hugging Face TB Research
Enterprise
community
AI & ML interests
Exploring synthetic datasets, generated by Large Language Models (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)
Organization Card
About org cards
HuggingFaceTB
This is the home of synthetic datasets for pre-training, such as Cosmopedia. We're trying to scale synthetic data generation by curating diverse prompts that cover a wide range of topics and efficiently scaling the generations on GPUs with tools like llm-swarm.
We recently released:
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
- Cosmo-1B a 1B model trained on Cosmopedia.
For more details check our blogpost: https://huggingface.co/blog/cosmopedia
Collections
1
spaces
1
datasets
20
HuggingFaceTB/sample_log_probs
Viewer
•
Updated
•
20k
•
5
HuggingFaceTB/cosmopedia_stanford_openstax_wiki_1k
Viewer
•
Updated
•
3k
•
11
HuggingFaceTB/cosmopedia_web_textbooks_all_2B
Updated
•
1
HuggingFaceTB/cosmopedia_2B_annotated_edu_score
Viewer
•
Updated
•
2.69M
•
3
•
1
HuggingFaceTB/cosmopedia
Viewer
•
Updated
•
31.1M
•
3.05k
•
505
HuggingFaceTB/wiki_applied_sciences_college_students_1k
Viewer
•
Updated
•
1k
•
1
HuggingFaceTB/wiki_natural_sciences_college_high_school_students_1k
Viewer
•
Updated
•
1k
•
6
•
1
HuggingFaceTB/images
Viewer
•
Updated
•
13
HuggingFaceTB/bisac-topics
Viewer
•
Updated
•
5.5k
•
1
HuggingFaceTB/web_under_line_mean_100
Viewer
•
Updated
•
1.16k
•
3