Hugging Face TB Research

Enterprise

community

AI & ML interests

Exploring smol models and high quality web and synthetic datasets, generated by LLMs (TB is for Textbook, as inspired by the "Textbooks are all your need" paper)

Organization Card

Community About org cards

HuggingFaceTB

This is the home for smol models (SmolLM) and high quality pre-training datasets. We released:

FineWeb-Edu: a filtered version of FineWeb dataset for educational content, paper available here.
Cosmopedia: the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available here.
Smollm-Corpus: the pre-training corpus of SmolLM: Cosmopedia v0.2, FineWeb-Edu dedup and Python-Edu. Blog post available here.
SmolLM models and SmolLM2 models: a series of strong small models in three sizes: 135M, 360M and 1.7B
SmolVLM: a 2 billion Vision Language Model (VLM) built for on-device inference. It uses SmolLM2-1.7B as a language backbone. Blog post available here.

News 🗞️

SmolLM2: you can find our most capable model SmolLM2-1.7B here: https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct and our training and evaluation toolkit at: https://github.com/huggingface/smollm
We released our SFT mix SmolTalk, a 1M samples synthetic dataset to improve instruction following, chat and reasoning: https://hf.co/datasets/HuggingFaceTB/smoltalk
SmolVLM: a lightweight 2B Vision Language Model available here https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct

Collections 7

spaces 5

Running on Zero

SmolVLM

SmolLM2 1.7B Instruct WebGPU

A blazingly fast & powerful AI chatbot that runs in-browser!

SmolLM 360M Instruct WebGPU

A blazingly fast and powerful AI chatbot that runs locally.

Instant SmolLM

Run SmolLM-360M-Instruct in realtime with MLC WebLLM

Web clusters

models 34

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • Updated about 21 hours ago • 21.7k • 218

HuggingFaceTB/SmolLM2-nanotron-ckpt

Updated 3 days ago

HuggingFaceTB/SmolVLM-Base

Image-Text-to-Text • Updated 5 days ago • 1.8k • 19

HuggingFaceTB/SmolLM2-1.7B-Instruct-Q8-mlx

Text Generation • Updated 6 days ago • 21

HuggingFaceTB/SmolLM2-135M-Instruct-Q8-mlx

Text Generation • Updated 6 days ago • 27

HuggingFaceTB/SmolLM2-360M-Instruct-Q8-mlx

Text Generation • Updated 6 days ago • 18

HuggingFaceTB/SmolVLM-Synthetic

Image-Text-to-Text • Updated 7 days ago • 116 • 7

HuggingFaceTB/SmolVLM-Instruct-DPO

Image-Text-to-Text • Updated 7 days ago • 234 • 10

HuggingFaceTB/SmolLM2-1.7B-Instruct

Text Generation • Updated 7 days ago • 92.6k • • 403

HuggingFaceTB/SmolLM2-135M-Instruct

Text Generation • Updated 8 days ago • 26.6k • 66

datasets 31

HuggingFaceTB/smoltalk

Viewer • Updated 7 days ago • 2.2M • 3.93k • 203

HuggingFaceTB/smol-smoltalk

Viewer • Updated 12 days ago • 485k • 841 • 17

HuggingFaceTB/MATH

Updated Oct 16 • 78 • 2

HuggingFaceTB/smollm-corpus

Viewer • Updated Sep 6 • 237M • 38.2k • 252

HuggingFaceTB/everyday-conversations-llama3.1-2k

Viewer • Updated Aug 17 • 2.38k • 561 • 80

HuggingFaceTB/instruct-data-basics-smollm-H4

Viewer • Updated Aug 17 • 767 • 171

HuggingFaceTB/self-oss-instruct-sc2-H4

Viewer • Updated Aug 17 • 50.7k • 229 • 2

HuggingFaceTB/Magpie-Pro-300K-Filtered-H4

Viewer • Updated Aug 17 • 300k • 181 • 2

HuggingFaceTB/OpenHermes-2.5-H4

Viewer • Updated Aug 17 • 1M • 214 • 2

HuggingFaceTB/bisac_expanded_topics

Viewer • Updated Aug 14 • 34.2k • 35