@m-ric on Hugging Face: "𝗡𝗲𝘄 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 𝗿𝗮𝗻𝗸𝘀 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿…"

Post

774

𝗡𝗲𝘄 𝗹𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 𝗿𝗮𝗻𝗸𝘀 𝗟𝗟𝗠𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠-𝗮𝘀-𝗮-𝗷𝘂𝗱𝗴𝗲: 𝗟𝗹𝗮𝗺𝗮-𝟯.𝟭-𝟳𝟬𝗕 𝘁𝗼𝗽𝘀 𝘁𝗵𝗲 𝗿𝗮𝗻𝗸𝗶𝗻𝗴𝘀! 🧑‍⚖️

Evaluating systems is critical during prototyping and in production, and LLM-as-a-judge has become a standard technique to do it.

First, what is "LLM-as-a-judge"?
👉 It's a very useful technique for evaluating LLM outputs. If anything you're evaluating cannot be properly evaluated with deterministic criteria, like the "politeness" of an LLM output, or how faithful it is to an original source, you can use LLM-judge instead : prompt another LLM with "Here's an LLM output, please rate this on criterion {criterion} on a scale of 1 to 5", then parse the number from its output, and voilà, you get your score.

🧐 But who judges the judge?
How can you make sure your LLM-judge is reliable? You can have a specific dataset annotated with scores provided by human judges, and compare how LLM-judge scores correlate with human judge scores.

📊 Before even running that benchmark, to get you started, there's a new option to get you started: a leaderboard that measures how well different model perform as judges!

And the outcome is surprising, models come in quite different orders from what we're used to in general rankings: probably some have much better bias mitigation than others!

Take a deeper look here 👉 https://huggingface.co/blog/arena-atla

Join the conversation