Daniel van Strien's picture

Daniel van Strien PRO

davanstrien

·

https://danielvanstrien.xyz/

AI & ML interests

Machine Learning Librarian

Articles

Share your open ML datasets on Hugging Face Hub!

Scaling AI-based Data Processing with Hugging Face + Dask

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

upvoted an article 3 days ago

Article

Releasing the largest multilingual open pretraining dataset

By

•

3 days ago

• 85

upvoted a collection 7 days ago

Dataset Exploration

3 items • Updated 6 days ago • 1

upvoted an article 7 days ago

Article

Inference Endpoints Changelog 🚀

By

•

Oct 11

• 18

upvoted a paper 15 days ago

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Paper • 2410.23331 • Published 17 days ago • 7

upvoted a collection 16 days ago

SmolLM2

State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 8 items • Updated 12 days ago • 167

upvoted a paper 17 days ago

Florence: A New Foundation Model for Computer Vision

Paper • 2111.11432 • Published Nov 22, 2021 • 3

upvoted a collection 18 days ago

Llama-3.2

5 items • Updated 30 days ago • 1

upvoted a collection 24 days ago

Granite 3.0 Language Models

A series of language models trained by IBM licensed under Apache 2.0 license. We release both the base pretrained and instruct models. • 8 items • Updated 12 days ago • 87

upvoted 7 articles 25 days ago

Article

Releasing Outlines-core 0.1.0: structured generation in Rust and Python

26 days ago

• 41

Article

ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

By

•

30 days ago

• 15

Article

OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B

By

•

25 days ago

• 2

Article

OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B

By

•

29 days ago

• 13

Article

🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

By

•

26 days ago

• 18

Article

Aria: First Open Multimodal Native MoE Model

By

•

26 days ago

• 6

Article

Allegro: Advanced Video Generation Model

By

•

26 days ago

• 55

upvoted an article about 1 month ago

Article

How to build a custom text classifier without days of human labeling

By

•

about 1 month ago

• 55

upvoted a paper about 1 month ago

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8 • 107

upvoted 3 articles about 1 month ago

Article

Improving Parquet Dedupe on Hugging Face Hub

Oct 5

• 29

Article

Faster Assisted Generation with Dynamic Speculation

Oct 8

• 31

Article

Scaling AI-based Data Processing with Hugging Face + Dask

Oct 9

• 23