Spaces:

AtlaAI
/

judge-arena

Running

App Files Files Community

Which models do you want to see on here?

by kaikaidai - opened 2 days ago

Discussion

kaikaidai

Atla org 2 days ago

We started with the following models as we've seen them most commonly used in eval pipelines

OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo)
Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku)
Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B)
Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B)
Google (Gemma 2 9B / 27B)
Mistral (Instruct v0.3 7B, Instruct v0.1 7B)

What models would you be curious to see on here next?

CombinHorizon

1 day ago

•

edited 1 day ago

What about these models:

microsoft ( Phi-3-medium-4k-instruct 14B )
Alibaba ( Qwen 2.5 32B, 14B )
Upstage ( solar-pro-preview-instruct 22B)
Mistral ( Mistral-Large-Instruct-2407 123B )

(as reference for which models to choose) other than the some common benchmarks

here's one [benchmark] that is related to judging:

judgemark

But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)

jhoareau

1 day ago

Gemini models.

davidberenstein1957

1 day ago

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

bergr7f

1 day ago

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.

kaikaidai

Atla org 1 day ago

What about these models:

microsoft ( Phi-3-medium-4k-instruct 14B )

Alibaba ( Qwen 2.5 32B, 14B )

Upstage ( solar-pro-preview-instruct 22B)

Mistral ( Mistral-Large-Instruct-2407 123B )

(as reference for which models to choose) other than the some common benchmarks

open_llm_leaderboard

eqbench

here's one [benchmark] that is related to judging:

judgemark

But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)

Good shouts! I'm curious to see how those Qwen models would do given that the 2.5 7B is doing pretty well. And those benchmarks are very interesting, evaluating writing quality is a seriously tough task...

The judge score and critique are extracted from a JSON output {"feedback": "", "result": } similar to the Lynx paper

kaikaidai

Atla org 1 day ago

https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo

hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.

👀 will add flow judge in our next update, I'm super excited to see how a dedicated 3.8B model does

ChuckMcSneed

1 day ago

Add Command-r and Command-r+, both old and new. They were the least positively biased in my experience.

bittersweet

about 19 hours ago

•

edited about 19 hours ago

What a great work! we are looking forward such an arena for Judge models!

How about add compassjudger series (https://github.com/open-compass/CompassJudger),
which reached top performance on
RewardBench(https://huggingface.co/spaces/allenai/reward-bench),
JudgerBench(https://huggingface.co/spaces/opencompass/judgerbench_leaderboard),
JudgeBench(https://huggingface.co/spaces/ScalerLab/JudgeBench) between generative models.
And also can be applied to many subjective evaluation datasets as judge model. For example in ArenaHard: https://github.com/lmarena/arena-hard-auto/issues/49

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment