Please add Qwen-7b and Qwen-7b-chat to the LLM leaderboard.

#164
by vonjack - opened

They claimed the great MMLU score compared to Llama-13b due to the huge amount and better quality of data for training (2.2T), the better tokenizer (tiktoken) and the huge word vocabulary (151k).

Yes. interesting 7B model that performs on the level of 13B models.

Open LLM Leaderboard org

Hi @vonjack !
Feel free to submit these models if they are on the hub!

@clefourrier
got error Model "Qwen/Qwen-7B-Chat" needs to be launched with trust_remote_code=True. For safety reason, we do not allow these models to be automatically submitted to the leaderboard.
I think they use their own tokenizer is that why it not allowed?

Open LLM Leaderboard org

@felixz
Hi! Oh, I see!
We don't allow models which need trust_remote_code=True to be submitted automatically at the moment (for safety reasons, as we can't check the code of every model submitted) - but it's on the todos for September/October: we need to setup specific things on our cluster to implement that. Thank you for your patience, I'll be sure to keep everybody updated!

clefourrier changed discussion status to closed

They claimed the great MMLU score compared to Llama-13b due to the huge amount and better quality of data for training (2.2T), the better tokenizer (tiktoken) and the huge word vocabulary (151k).

I would like to challenge that claim. Qwen (1.8) seems to be on par with similar sized models for other benchmarks such as hellawag, arc_challenge or winogrande. The huuuge! performance jump on MMLU is imho not explainable by tokenizer or cleaner data. I tend to believe they trained on the benchmark data. Maybe not even knowlingy on a chinese translation.

Sign up or log in to comment