46 256 259

Aymeric Roucher

m-ric

http://aymeric-roucher.github.io

AI & ML interests

MLE at Hugging Face 🤗 LLMs, Agents, RAG, Multimodal.

Recent Activity

liked a Space about 18 hours ago

andrewrreed/phoenix-arize-observability-demo

posted an update about 22 hours ago

Made a new app to visualize the LLM race ⇒ 𝗡𝗼 𝗘𝘂𝗿𝗼𝗽𝗲𝗮𝗻 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝗶𝗻 𝘁𝗵𝗲 𝘁𝗼𝗽 𝟭𝟬 🇪🇺❌ See the app here 👉 https://huggingface.co/spaces/m-ric/llm-race-to-the-top I've adapted an app by @andrewrreed that tracks progress of LLMs (https://huggingface.co/spaces/andrewrreed/closed-vs-open-arena-elo), on the Chatbot Arena leaderboard, to compare companies from different countries. The outcome is quite sad, as a Frenchman and European. The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷. American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬 ⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.

upvoted an article 1 day ago

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

View all activity

Articles

Organizations

m-ric's activity

liked a Space about 18 hours ago

Running

🔍

Phoenix Arize Observability Demo

Phoenix observability dashboard on Hugging Face Spaces

posted an update about 22 hours ago

Post

426

Made a new app to visualize the LLM race ⇒ 𝗡𝗼 𝗘𝘂𝗿𝗼𝗽𝗲𝗮𝗻 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝗶𝗻 𝘁𝗵𝗲 𝘁𝗼𝗽 𝟭𝟬 🇪🇺❌

See the app here 👉 m-ric/llm-race-to-the-top

I've adapted an app by @andrewrreed that tracks progress of LLMs ( andrewrreed/closed-vs-open-arena-elo), on the Chatbot Arena leaderboard, to compare companies from different countries.

The outcome is quite sad, as a Frenchman and European.

The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.

American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬

⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.

upvoted an article 1 day ago

Article

Introducing Observers: AI Observability with Hugging Face datasets through a lightweight SDK

•

2 days ago

• 17

updated a Space 1 day ago

Running

🗺️🏕️

AI Travel Planner

Plan your next vacation with the help of an AI!

updated 2 collections 1 day ago

🤖 Agents

Collection

17 items • Updated 1 day ago • 34

🧑‍⚖️ LLM-as-a-judge

Collection

5 items • Updated 1 day ago

New activity in AtlaAI/judge-arena 1 day ago

What are Meta-Llama-3.1-Instruct "Turbo" models?

#4 opened 1 day ago by

m-ric

posted an update 2 days ago

Post

647

𝗡𝗲𝘄 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 𝗿𝗮𝗻𝗸𝘀 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲: 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕 𝘁𝗼𝗽𝘀 𝘁𝗵𝗲 𝗿𝗮𝗻𝗸𝗶𝗻𝗴𝘀! 🧑‍⚖️

Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.

First, what is "LLM-as-a-judge"?
👉 It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilà, you get your score.

🧐 But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.

📊 Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!

And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!

Take a deeper look here 👉 https://huggingface.co/blog/arena-atla

liked a Space 2 days ago

Running

💻

Judge Arena

posted an update 2 days ago

Post

221

Lifehack of the day:
Adding "r.jina.ai/" before any url transforms it in Markdown using Jina AI's Reader! Here with @cyrilzakka 's blog post.

liked a dataset 3 days ago

mlabonne/orca-agentinstruct-1M-v1-cleaned

Viewer • Updated 5 days ago • 1.05M • 463 • 38

Reacted to cfahlgren1's post with ❤️ 3 days ago

Post

2828

You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned

1 reply

posted an update 3 days ago

Post

750

🔍 Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

🤔 42%, is that not too low?
➡️ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
💡 So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

🔄 Two-step approach:
‣ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
‣ Fine-tuned Llama 2 7B ranks the most likely culprits

🎓 Training pipeline:
‣ Continued pre-training on Meta's internal docs and wikis
‣ Supervised fine-tuning on past incident investigations
‣ Training data mimicked real-world constraints (2-20 potential changes per incident)

🔮 Now future developments await:
‣ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
‣ Improvements in model reasoning should boost accuracy further

Read it in full 👉 https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response

upvoted an article 3 days ago

Article

Halo: Open Source Health Tracking with Wearables

•

4 days ago

• 74

liked a Space 4 days ago

Running

📉

Posts Leaderboard

Reacted to reach-vb's post with 🔥 4 days ago

Post

3974

What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! 🤗

upvoted an article 4 days ago

Article

Decoding Strategies in Large Language Models

•

25 days ago

• 38

posted an update 4 days ago

Post

1320

Great feature alert: 𝗬𝗼𝘂 𝗰𝗮𝗻 𝗻𝗼𝘄 𝘂𝘀𝗲 𝗮𝗻𝘆 𝗦𝗽𝗮𝗰𝗲 𝗮𝘀 𝗮 𝘁𝗼𝗼𝗹 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀.𝗮𝗴𝗲𝗻𝘁! 🛠️🔥🔥

This lets you take the coolest spaces, like FLUX.1-dev, and use them in agentic workflows with a few lines of code! 🧑‍💻

On the video below, I set up my fake vacation pictures where I'm awesome at surfing (I'm really not) 🏄

Head to the doc to learn this magic 👉 https://huggingface.co/docs/transformers/main/en/agents_advanced#import-a-space-as-a-tool-