1675 661 1638

Julien Chaumond PRO

julien-c

https://huggingface.co

AI & ML interests

<3 ML/AI for everyone, building products to propel communities fwd

Recent Activity

reacted to jsulz's post with ❤️ about 2 hours ago

reacted to jsulz's post with 🔥 about 2 hours ago

reacted to jsulz's post with 🚀 about 18 hours ago

Articles

Hugging Face Selected for the French Data Protection Agency Enhanced Support Program

May 15, 2023

How to train a new language model from scratch using Transformers and Tokenizers

Feb 14, 2020

• 21

Organizations

julien-c's activity

reacted to jsulz's post with ❤️🔥 about 2 hours ago

Post

779

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks

reacted to jsulz's post with 🚀 about 18 hours ago

Post

1905

In August, the XetHub team joined Hugging Face
- https://huggingface.co/blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.

Getting this information was no small feat. We had to:
* Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face.
* Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.

You can read more about the findings (with some jaw-dropping stats + charts) here https://www.linkedin.com/feed/update/urn:li:activity:7244486280351285248

replied to PLB's post 2 days ago

cool!

reacted to AdinaY's post with 🔥 6 days ago

Post

2478

Let’s dive into the exciting releases from the Chinese community last week 🔥🚀
More details 👉 https://huggingface.co/zh-ai-community

Code model:
✨Qwen 2.5 coder by Alibaba Qwen
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f
✨OpenCoder by InflyAI - Fully open code model🙌
infly/opencoder-672cec44bbb86c39910fb55e

Image model:
✨Hunyuan3D-1.0 by Tencent
tencent/Hunyuan3D-1

MLLM:
✨JanusFlow by DeepSeek
deepseek-ai/JanusFlow-1.3B
deepseek-ai/JanusFlow-1.3B
✨Mono-InternVL-2B by OpenGVlab
OpenGVLab/Mono-InternVL-2B

Video model:
✨CogVideoX 1.5 by ChatGLM
THUDM/CogVideoX1.5-5B-SAT

Audio model:
✨Fish Agent by FishAudio
fishaudio/fish-agent-v0.1-3b

Dataset:
✨OPI dataset by BAAIBeijing
BAAI/OPI

replied to PLB's post 7 days ago

wow this is super cool!
Do you have a page where we can see it live?

oh ok i guess on https://phospho.ai/

reacted to PLB's post with 🚀 7 days ago

Post

1821

⚠️ People selling AI chatbots for websites hate us.
Add an open source chat assistant on your website in 5 minutes: https://github.com/phospho-app/ai-chat-bubble

How does it work ?
- You give an URL
- The AI assistant crawls the website content and embed it
- Add it to your frontend in one line of code
- People on your website can ask the assistant questions

Powered by BAAI/bge-small-en-v1.5 and Mistral AI

5 replies

replied to reach-vb's post 16 days ago

cool recap

reacted to louisbrulenaudet's post with ❤️ 16 days ago

Post

2080

My biggest release of the year: a series of 7 specialized embedding models for information retrieval within tax documents, is now available for free on Hugging Face 🤗

These new models aim to offer an open source alternative for in-domain semantic search from large text corpora and will improve RAG systems and context addition for large language models.

Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat ), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.

A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c , @clem 🤗, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions ❤️

Models are available on my personal HF page, into the Lemone-embed collection: louisbrulenaudet/lemone-embed-66fdc24000df732b395df29b

1 reply

replied to nroggendorff's post 19 days ago

will fix it thanks for the report:)

reacted to clem's post with 🤗❤️🚀🔥 27 days ago

Post

4288

This is no Woodstock AI but will be fun nonetheless haha. I’ll be hosting a live workshop with team members next week about the Enterprise Hugging Face hub.

1,000 spots available first-come first serve with some surprises during the stream!

You can register and add to your calendar here: https://streamyard.com/watch/JS2jHsUP3NDM

4 replies

replied to thomwolf's post 27 days ago

hehe

reacted to thomwolf's post with 🤗🚀❤️ 27 days ago

Post

4031

Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨

2 replies

reacted to louisbrulenaudet's post with 🔥 about 1 month ago

Post

3092

🚨 I have $3,500 in Azure credits, including access to an H100 (96 Go), expiring on November 12, 2024.

I won’t be able to use it all myself, so I’m reaching out to the @huggingface community: Are there any open-source projets with data ready for some compute power?

Let’s collaborate and make the most of it together 🔗