Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face 🤗 LLMs, Agents, RAG, Multimodal.

Recent Activity

Articles

Organizations

m-ric's activity

posted an update about 22 hours ago
view post
Post
426
Made a new app to visualize the LLM race ⇒ 𝗡𝗼 𝗘𝘂𝗿𝗼𝗽𝗲𝗮𝗻 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 𝗶𝗻 𝘁𝗵𝗲 𝘁𝗼𝗽 𝟭𝟬 🇪🇺❌

See the app here 👉 m-ric/llm-race-to-the-top

I've adapted an app by @andrewrreed that tracks progress of LLMs ( andrewrreed/closed-vs-open-arena-elo), on the Chatbot Arena leaderboard, to compare companies from different countries.

The outcome is quite sad, as a Frenchman and European.

The top 10 is exclusively US 🇺🇸 and Chinese 🇨🇳 companies (after great Chinese LLM releases recently, like the Qwen2.5 series), with the notable exception of Mistral AI 🇫🇷.

American companies are making fast progress, Chinese ones even faster. Europe is at risk of being left behind. And the EU AI Act hasn't even come into force yet to slow down the EU market. We need to wake up 😬

⚠️ Caution: This Chatbot Arena ELO ranking is not the most accurate, especially at high scores like this, because LLM makers can game it to some extent.
upvoted an article 1 day ago
New activity in AtlaAI/judge-arena 1 day ago
posted an update 2 days ago
view post
Post
647
𝗡𝗲𝘄 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 𝗿𝗮𝗻𝗸𝘀 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲: 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕 𝘁𝗼𝗽𝘀 𝘁𝗵𝗲 𝗿𝗮𝗻𝗸𝗶𝗻𝗴𝘀! 🧑‍⚖️

Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.

First, what is "LLM-as-a-judge"?
👉 It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilà, you get your score.

🧐 But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.

📊 Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!

And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!

Take a deeper look here 👉 https://huggingface.co/blog/arena-atla
liked a Space 2 days ago
posted an update 2 days ago
view post
Post
221
Lifehack of the day:
Adding "r.jina.ai/" before any url transforms it in Markdown using Jina AI's Reader! Here with @cyrilzakka 's blog post.
Reacted to cfahlgren1's post with ❤️ 3 days ago
view post
Post
2828
You can clean and format datasets entirely in the browser with a few lines of SQL.

In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset.

The cleaning process consists of:
- Joining the separate splits together / add split column
- Converting string messages into list of structs
- Removing empty system prompts

https://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-dataset

Here's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned
  • 1 reply
·
posted an update 3 days ago
view post
Post
750
🔍 Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

🤔 42%, is that not too low?
➡️ Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
💡 So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

🔄 Two-step approach:
‣ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
‣ Fine-tuned Llama 2 7B ranks the most likely culprits

🎓 Training pipeline:
‣ Continued pre-training on Meta's internal docs and wikis
‣ Supervised fine-tuning on past incident investigations
‣ Training data mimicked real-world constraints (2-20 potential changes per incident)

🔮 Now future developments await:
‣ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
‣ Improvements in model reasoning should boost accuracy further

Read it in full 👉 https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
upvoted an article 3 days ago
liked a Space 4 days ago
Reacted to reach-vb's post with 🔥 4 days ago
view post
Post
3974
What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! 🤗
upvoted an article 4 days ago
posted an update 4 days ago
view post
Post
1320
Great feature alert: 𝗬𝗼𝘂 𝗰𝗮𝗻 𝗻𝗼𝘄 𝘂𝘀𝗲 𝗮𝗻𝘆 𝗦𝗽𝗮𝗰𝗲 𝗮𝘀 𝗮 𝘁𝗼𝗼𝗹 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀.𝗮𝗴𝗲𝗻𝘁! 🛠️🔥🔥

This lets you take the coolest spaces, like FLUX.1-dev, and use them in agentic workflows with a few lines of code! 🧑‍💻

On the video below, I set up my fake vacation pictures where I'm awesome at surfing (I'm really not) 🏄

Head to the doc to learn this magic 👉 https://huggingface.co/docs/transformers/main/en/agents_advanced#import-a-space-as-a-tool-