How do you get the reported Arc score of 85.8?

#3
by deleted - opened
deleted

Mistral is bragging that it beats GPT3.5 across the board, including with an Arc score of 85.8, yet all models are only achieving 66.

Can someone PLEASE explain why this discrepancy exists? Did they run the test using all 8 experts at once during inference rather than 2 in order to hit 85.8 on the Arc test?

First on the leaderboard I see 70.22 not 66.
Second that ARC 85.8 is getting medium version not small like we got so far.

@mirek190 Thanks for responding. But I just double checked and confirmed that the Arc score of 85.8 is for the foundational 7x8 model (e.g. see the link to their website below). 70.22 is for the instruct, 66 for the foundational.

Plus all the other scores for 8x7b match perfectly (e.g. MMLU, HellaSwag and Winogrande) and the as yet unreleased "medium" model reports scores that are about 3-5 points higher than these across the board.

The 85.8 Arc score needs to be independently verified ASAP. A claim of matching GPT3.5's performance was ALWAYS about Arc. Mistral 7b already came within 3-5 points of matching GPT3.5 on the other benchmarks like (65 MMLU, 83 HellaSwag, 78 WinoGrande). Simply boosting Mistral 7b to Mistral 14b dense would have easily matched GPT3.5's performance on all but Arc. Arc requires "brains" (60 for Mistral 7b and 85 for GPT3.5). Mixtral 7x8 isn't performing better on most tests, including Arc, than Mistral 14b dense would have sans the multi-linguage upgrade (French, Spanish... x3).

https://mistral.ai/news/mixtral-of-experts/

@mirek190 Thanks for responding. But I just double checked and confirmed that the Arc score of 85.8 is for the foundational 7x8 model (e.g. see the link to their website below). 70.22 is for the instruct, 66 for the foundational.

Plus all the other scores for 8x7b match perfectly (e.g. MMLU, HellaSwag and Winogrande) and the as yet unreleased "medium" model reports scores that are about 3-5 points higher than these across the board.

The 85.8 Arc score needs to be independently verified ASAP. A claim of matching GPT3.5's performance was ALWAYS about Arc. Mistral 7b already came within 3-5 points of matching GPT3.5 on the other benchmarks like (65 MMLU, 83 HellaSwag, 78 WinoGrande). Simply boosting Mistral 7b to Mistral 14b dense would have easily matched GPT3.5's performance on all but Arc. Arc requires "brains" (60 for Mistral 7b and 85 for GPT3.5). Mixtral 7x8 isn't performing better on most tests, including Arc, than Mistral 14b dense would have sans the multi-linguage upgrade (French, Spanish... x3).

https://mistral.ai/news/mixtral-of-experts/

Also from their webpage llama 2 70b also has ARC 85 so i suspec they are using different ARC questions because llama2-70b huggingface leaderboard has 67.32.

deleted

Thanks @mirek190 The fact that Llama 2 70b has an Arc of 85 is very relevant. I just assumed that's what Llama2 70b got on HF because it's the same 25-shot test and the same 7.5k questions.

deleted changed discussion status to closed

Sign up or log in to comment