The CLEM Leaderboard aims to track, rank and evaluate current cLLMs (chat-optimized Large Language Models) with the suggested pronounciation β€œclems”. The benchmarking approach is described in [Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents](https://aclanthology.org/2023.emnlp-main.689.pdf). The multimodal benchmark is described in [Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models](https://arxiv.org/abs/2406.14035) Source code for benchmarking "clems" is available here: [Clembench](https://github.com/clembench/clembench) All generated files and results from the benchmark runs are available here: [clembench-runs](https://github.com/clembench/clembench-runs)
The clemscore combines a score representing the overall ability to just follow the game instructions (separately scored in field Played) and the quality of the play in attempt where instructions were followed (field Quality Scores). For details about the games / interaction settings, and for results on older versions of the benchmark, see the tab Versions and Details.