CompassJudger Subjective Evaluation Learderboard
CompassJudger Subjective Evaluation Learderboard
CompassJudger Subjective Evaluation Learderboard
Note By Shanghai AI Lab
VLMEvalKit Evaluation Results Collection
Note By OpenMMLab The OpenVLM Leaderboard evaluates and ranks 62 Vision-Language Models (VLMs) across 23 multi-modal benchmarks using the VLMEvalKit, featuring only open-source or publicly available API models.
Note By BAAI. The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the FlagEval platform, providing corresponding computational resources and runtime environment. The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency.
Arena
Note By BAAI Featuring 50 popular closed-source models from China and beyond!
Note By Shanghai AI Lab An LLM leaderboard for Chinese models on many metric axes - super complete
Note By Tencent AI Text to video generation leaderboard
Realtime Image/Video Gen AI Arena
Note By Tiger Lab An arena for image generation!
Note By Alibaba - DAMO Southeast Asian (SEA) languages leaderboard
Note By Jina AI and BAAI A new benchmark focuses on fair out-of-domain evaluation for RAG & NeuralIR
Leaderboard for LLM for Science Reasoning
Note By Tiger Lab Leaderboard for Science reasoning.
Note By Shanghai AI Lab Leaderboard for Video Generative Models.
JudgerBench Leaderboard
A Benchmark for Metamorphic Evaluation of T2V Generation
Note By PKU-Yuan group ChronoMagic-Bench represents the inaugural benchmark dedicated to assessing T2V models' capabilities in generating time-lapse videos that demonstrate significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text control.
Note TempCompass is a benchmark to evaluate the temporal perception ability of Video LLMs.
Efficient Image/Video K-Sort Arena
Note K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Note Leaderboard for LLM Safety.