fromplutowithlove commited on
Commit
996b6a9
1 Parent(s): ee47981

Fix spelling error in README

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -59,7 +59,7 @@ Below are the comparison results on existing multi-image benchmarks. On average,
59
 
60
  **BLINK**: a benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
61
 
62
- | Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
63
  |--|--|--|--|--|--|--|--|--|--|
64
  | Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
65
  | Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
@@ -79,7 +79,7 @@ Below are the comparison results on existing multi-image benchmarks. On average,
79
 
80
  **Video-MME**: comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
81
 
82
- | Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
83
  |--|--|--|--|--|--|--|--|--|--|
84
  | short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
85
  | medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |
 
59
 
60
  **BLINK**: a benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
61
 
62
+ | Benchmark | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
63
  |--|--|--|--|--|--|--|--|--|--|
64
  | Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
65
  | Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
 
79
 
80
  **Video-MME**: comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
81
 
82
+ | Benchmark | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
83
  |--|--|--|--|--|--|--|--|--|--|
84
  | short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
85
  | medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |