Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 66
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 27
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 3 days ago • 85
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists Paper • 2410.23331 • Published 17 days ago • 7
SmolLM2 Collection State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 8 items • Updated 12 days ago • 167
Granite 3.0 Language Models Collection A series of language models trained by IBM licensed under Apache 2.0 license. We release both the base pretrained and instruct models. • 8 items • Updated 12 days ago • 87
view article Article Releasing Outlines-core 0.1.0: structured generation in Rust and Python 26 days ago • 41
view article Article ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models By ahmed-masry • 30 days ago • 15
view article Article OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B By PandorAI1995 • 25 days ago • 2
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B By PandorAI1995 • 29 days ago • 13
view article Article 🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦⬛ By anakin87 • 26 days ago • 18
view article Article How to build a custom text classifier without days of human labeling By sdiazlor • about 1 month ago • 55