Claude-3.5 Evaluation Results on Open VLM Leaderboard

Community Article Published June 24, 2024

image/png

Claude3.5-Sonnet is the latest large multi-modal model released by Anthropic, and it is the first version of the Claude 3.5 series. According to official blog, this model surpasses its predecessor such as Claude3-Opus and Gemini-1.5-Pro in terms of multi-modal understanding. To verify this, we tested Claude3.5-Sonnet on eight objective image-text multimodal evaluation benchmarks in the Open VLM Leaderboard.

Dataset \ Model GPT-4o-20240513 Claude3.5-Sonnet Gemini-1.5-Pro GPT-4v-20240409 Claude3-Opus
Overall Rank 1 2 3 4 16
Avg. Score 69.9 67.9 64.4 63.5 54.4
MMBench v1.1 82.2 78.5 73.9 79.8 59.1
MMStar 63.9 62.2 59.1 56.0 45.7
MMMU_VAL 69.2 65.9 60.6 61.7 54.9
MathVista_MINI 61.3 61.6 57.7 54.7 45.8
HallusionBench Avg. 55.0 49.9 45.6 43.9 37.8
AI2D_TEST 84.6 80.2 79.1 78.6 70.6
OCRBench 736 788 754 656 694
MMVet 69.1 66 64 67.5 51.7

The evaluation results show that the objective performance of Claude3.5-Sonnet has greatly improved compared to Claude3-Opus, with the average score over all benchmarks improved more than 10%, and its overall ranking has risen from 16th to 2nd. Specifically, Claude3.5 ranked in the top two in six out of the eight benchmarks, and achieved the best results in multimodal mathematics and optical characters recognition.

Potential issues: API models such as GPT-4o and Claude3.5-Sonnet are released with officially reported performance on several multimodal evaluation benchmarks. Since they have not made the test scripts public, we failed to reproduce some of the accuracies reported by the officials (such as AI2D). If you can reproduce significantly higher accuracy on some benchmarks, please contact us for updates: [email protected].

For more detailed performance, please refer to the Open VLM Leaderboard.