File size: 2,748 Bytes
eec4b6c
 
 
46e34b5
0dffa44
46e34b5
 
0dffa44
 
 
 
 
 
 
 
 
b330617
 
0dffa44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
663b39b
 
0dffa44
46e34b5
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: apache-2.0
---

# Model Card


Veagle significantly improves the textual understanding & interpretation of images. The unique feature of Veagle
is in its architectural change along with a combination of different components: a vision abstractor from mPlugOwl,
Q-Former from InstructBLIP, and the Mistral language model. This combination allows Veagle to better understand and 
interpret the connection between text and images achieving state-of-the-art results. Veagle starts with a pre-trained
vision encoder and language model and is trained in two stages. This method helps the model effectively use information
from images and text together.

Further details about Veagle can be found in this detailed blog post: https://superagi.com/superagi-veagle/

arXiv paper link - https://arxiv.org/abs/2403.08773

## Key Contributions

- Veagle has surpassed most state-of-the-art (SOTA) models in major benchmarks, capable of outperforming competitors
   in various tasks and domains.
- Using an optimized dataset, Veagle achieves high accuracy and efficiency. This demonstrates the model's effective
  learning from limited data. We meticulously curated a dataset of 3.5 million examples, specifically tailored to
  enhance visual representation learning.
- Veagle's architecture is a unique blend of components, including a visionary abstractor inspired by mPlugOwl,
  the Q-Former module from InstructBLIP, and the powerful Mistral language model. This innovative architecture,
  complemented by an additional projectional layer and architectural refinements, empowers Veagle to excel in multimodal tasks.


## Training

- Trained by: SuperAGI Team
- Hardware: NVIDIA 8 x A100 SxM (80GB)
- LLM: Mistral 7B
- Vision Encoder: mPLUG-OWL2
- Duration of pretraining: 12 hours
- Duration of finetuning: 25 hours
- Number of epochs in pretraining: 3
- Number of epochs in finetuning: 2
- Batch size in pretraining: 8
- Batch size in finetuning: 10
- Learning Rate: 1e-5
- Weight Decay: 0.05
- Optmizer: AdamW

## Steps to try
  ```python
  1.Clone the repository
  git clone https://github.com/superagi/Veagle
  cd Veagle
  ```

  ```python
  2. Run installation script
  source venv/bin/activate
  chmod +x install.sh
  ./install.sh
  ```

  ```python
  3. python evaluate.py --answer_qs \
   --model_name veagle_mistral \
  --img_path images/food.jpeg \
   --question "Is the food given in the image is healthy or not?"
  ```

## Evaluation

![Image 18-01-24 at 3.39 PM.jpg](https://cdn-uploads.huggingface.co/production/uploads/65a8fe900dba6b99a0164a47/bBBFaYI6maW_DKci9nl6L.jpeg)


## The SuperAGI team

Rajat Chawla, Arkajit Dutta, Tushar Verma, Adarsh Jha, Anmol Gautam, Ayush vatsal, 
Sukrit Chatterjee, Mukunda NS, Ishaan Bhola