File size: 5,648 Bytes
c9118f4
9e9abb0
c9118f4
 
 
 
 
 
336cb9d
c9118f4
336cb9d
c9118f4
e91dda6
336cb9d
9fc3117
93c4844
9fc3117
0067dce
9fc3117
 
9e9abb0
c9118f4
ab88027
c9118f4
 
52e9f27
336cb9d
0067dce
39535c4
336cb9d
6e0b0dd
c9118f4
 
ab88027
c9118f4
d658b03
336cb9d
 
 
d658b03
336cb9d
d658b03
c9118f4
 
 
 
 
 
 
 
 
 
9fc3117
d658b03
c9118f4
336cb9d
c9118f4
336cb9d
c9118f4
336cb9d
ab88027
336cb9d
ab88027
c9118f4
 
 
 
336cb9d
d658b03
336cb9d
 
 
93c4844
336cb9d
 
c9118f4
 
 
d658b03
c9118f4
 
 
 
 
 
 
 
 
 
48fbc9a
 
 
 
 
 
c9118f4
d658b03
9e9abb0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
license: apache-2.0
tags:
- multimodal
- conversational
- GGUF
- Image-Text-to-Text
---
# Omnivision

## Introduction

Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:

- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
  
**Quick Links:**
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
2. [Quickstart for local setup](#how-to-use-on-device)
3. Learn more in our [Blogs](https://nexa.ai/blogs/omni-vision)

**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)

## Intended Use Cases
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.

**Example Demo:**
Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro | FP16 requires 988 MB RAM and 948 MB storage space.

<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>


## Benchmarks

Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.

<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>

We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.

| Benchmark         | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
|-------------------|----------------------|-----------|-------------|
| MM-VET            | 27.5                | 23.9      | 49.5        |
| ChartQA (Test)    | 59.2                | NA        | 73.5        |
| MMMU (Test)       | 41.8                | 28.6      | 41.1        |
| MMMU (Eval)       | 39.9                | 30.4      | 41.1        |
| ScienceQA (Eval)  | 62.2                | 59.0      | NA          |
| ScienceQA (Test)  | 64.5                | 59.0      | NA          |
| POPE              | 89.4                | 84.1      | NA          |


## How to Use On Device
In the following, we demonstrate how to run Omnivision locally on your device.

**Step 1: Install Nexa-SDK (local on-device inference framework)**

[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)

> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

**Step 2: Then run the following code in your terminal**

```bash
nexa run omnivision 
```

## Model Architecture ##
Omnivision's architecture consists of three key components:

- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.

The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.

## Training

We developed Omnivision through a three-stage training pipeline:

**Pretraining:**
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.

**Supervised Fine-tuning (SFT):**
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.

**Direct Preference Optimization (DPO):**
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics

## What's next for Omnivision?
Omnivision is in early development and we are working to address current limitations:
- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
- Improve document and text understanding
  
In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications.

### Follow us
[Blogs](https://nexa.ai/blogs/omni-vision) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)