Manual Configuration Collection 5 datasets showcase YAML configuration on HuggingFace. See docs: https://huggingface.co/docs/hub/datasets-manual-configuration. • 5 items • Updated Nov 23, 2023 • 3
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 84
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24 • 49
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19 • 17
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio Jul 10 • 23
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published May 15 • 3
Leaderboards and benchmarks ✨ Collection Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ... • 67 items • Updated Aug 6 • 84
view article Article Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent Apr 22 • 78
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 28
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 21
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Apr 4 • 23