Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,66 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
# OmniFusion
|
5 |
+
|
6 |
+
**OmniFusion** is an advanced multimodal AI model designed to extend the capabilities of traditional language processing systems by integrating additional data modalities such as images, and potentially audio, 3D and video content.
|
7 |
+
|
8 |
+
### Architecture
|
9 |
+
|
10 |
+
<p align="left">
|
11 |
+
<img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/architecture.png" width="70%">
|
12 |
+
</p>
|
13 |
+
|
14 |
+
|
15 |
+
OmniFusion open source version core is Mistral-7B. Initially focusing on images, we selected the CLIP-ViT-L as the visual encoder for its efficient information transfer capabilities. The most important component of OmniFusion is its adapter, a mechanism allowing the language model to interpret and incorporate information from different modalities. The adapter is a single-layer, four-headed transformer, which has shown superior performance compared to simpler linear layers or MLP structures.
|
16 |
+
|
17 |
+
This adapter takes embeddings from the visual encoder (excluding the CLS token) and maps them into textual embeddings compatible with the language model.
|
18 |
+
|
19 |
+
To further enhance the model's multimodal capabilities, we employ trainable special tokens to mark the beginning and end of visual data within the text sequence.
|
20 |
+
|
21 |
+
|
22 |
+
### Training Process consists of two stages
|
23 |
+
|
24 |
+
1. Pre-training the adapter on Image Captioning tasks (LAION, CC-4M).
|
25 |
+
2. Once the adapter has learned to map ViT's visual embeddings to the language model's textual space, we proceed to unfreeze Mistral for improved understanding of dialog formats and complex queries.
|
26 |
+
|
27 |
+
<p align="left">
|
28 |
+
<img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/datasets.png" width="80%">
|
29 |
+
</p>
|
30 |
+
|
31 |
+
### Results
|
32 |
+
|
33 |
+
OmniFusion was benchmarked against the latest multimodal SOTA models. It excelled in generative metrics and classification benchmarks like VisualDialog.
|
34 |
+
<p align="left">
|
35 |
+
<img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/radar.png" width="50%">
|
36 |
+
</p>
|
37 |
+
|
38 |
+
Model Performance on Visual Dialog Benchmark
|
39 |
+
|
40 |
+
| Model | NDCG | MRR | Recall@1 | Recall@5 | Recall@10 |
|
41 |
+
| ------------ | ---- | ---- | -------- | -------- | --------- |
|
42 |
+
| OmniFusion | 25.91| 10.78| 4.74 | 13.80 | 20.53 |
|
43 |
+
| LLaVA-13B | 24.74| 8.91 | 2.98 | 10.80 | 18.02 |
|
44 |
+
|
45 |
+
### Examples
|
46 |
+
|
47 |
+
<p align="left">
|
48 |
+
<img src="https://raw.githubusercontent.com/AIRI-Institute/OmniFusion/main/content/examples.png" width="100%">
|
49 |
+
</p>
|
50 |
+
|
51 |
+
### Future Plans
|
52 |
+
|
53 |
+
Work is underway on a version that understands Russian, uses ImageBind encoders, and accepts more modalities (sound, 3D, video). Stay tuned for updates on GitHub!
|
54 |
+
|
55 |
+
### Authors
|
56 |
+
|
57 |
+
The FusionBrain scientific group from the AIRI Institute, in collaboration with scientists from Sber AI, led the model's development.
|
58 |
+
|
59 |
+
Main contributors:
|
60 |
+
+ Anton Razzhigaev: [Blog](https://t.me/abstractDL)
|
61 |
+
+ Elizaveta Goncharova
|
62 |
+
+ Matvey Mihkalchuk
|
63 |
+
+ Maxim Kurkin
|
64 |
+
+ Irina Abdullaeva
|
65 |
+
+ Denis Dimitrov [Blog](https://t.me/dendi_math_ai)
|
66 |
+
+ Andrey Kuznetsov [Blog](https://t.me/complete_ai)
|