microsoft
/

Phi-3.5-vision-instruct

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

fromplutowithlove commited on Aug 23

Commit

996b6a9

•

1 Parent(s): ee47981

Fix spelling error in README

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -59,7 +59,7 @@ Below are the comparison results on existing multi-image benchmarks. On average,
 **BLINK**: a benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
-| Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
 |--|--|--|--|--|--|--|--|--|--|
 | Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
 | Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
@@ -79,7 +79,7 @@ Below are the comparison results on existing multi-image benchmarks. On average,
 **Video-MME**: comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
-| Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
 |--|--|--|--|--|--|--|--|--|--|
 | short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
 | medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |

 **BLINK**: a benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
+| Benchmark | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
 |--|--|--|--|--|--|--|--|--|--|
 | Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
 | Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
 **Video-MME**: comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
+| Benchmark | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
 |--|--|--|--|--|--|--|--|--|--|
 | short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
 | medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |