Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Github | Inference Notebook | Dataset | Model Family

Model Details

We have developed and released the family of Vista 7B, which includes both a pretrained Projector and a finetuned version of the Vietnamese Vision Language Model (VLM). This model is optimized for image description tasks.

We continue to expand Vistral 7B's vision capabilities using the Llava approach, leveraging our proprietary Vista dataset with Siglip as an image encoder.

Disclaimer: The model has not been trained on OCR tasks and may perform poorly in OCR and graph analysis. Use with caution, as we have not focused on correcting the factual knowledge of the model.

Model developers Vi-VLM

Input Models input text and image.

Output Models generate image descriptions only.

Model Architecture Mistral.

Intended Use

Intended Use Cases Vista is primarily intended for research applications within the Vietnamese context. This version aims to further improve the Vietnamese Vision Language Model capabilities.

Out-of-scope The use of Vista in any manner that violates applicable laws or regulations is strictly prohibited.

How to use

Use with Kaggle Notebook

To run inference using the model, follow the steps outlined in our Kaggle Inference Notebook.

Training process

Training Metrics Image: Below is a snapshot of the training metrics visualized.

Training Metrics

Weights & Biases: Monitor the training progress and access additional analytics at our WandB project page.

Training Data

Pretrained Model:

  • Dataset: ShareGPT4V and a subset of WIT from the Vista dataset.

Finetuned Model:

  • Tasks:
    • Conversation
    • Complex reasoning
    • Detailed description
  • Dataset: Subset from the Vista dataset.

Hardware

GPU Configuration: Cluster of 2x NVIDIA A100-SXM4-40GB, provided by Google Cloud Research and VietAI. GPU Usage:

  • Pretrain: 4 hours of GPU time.
  • Finetune: 14 hours of GPU time.

Training Arguments

Parameter Pretrain Finetune (LoRA)
Epoch 1 1
Global batch size 16 16
Learning Scheduler cosine with warmup cosine with warmup
Optimizer AdamW AdamW
Warmup Ratio 0.03 0.03
Weight Decay 0.00 0.00
Learning rate (LLM) - 1.25e-5
Learning rate (Projector) 1e-3 1.25e-6
rank - 128
alpha - 256
Target modules - all linear layers

Examples

image/png

image/png

image/png

Responsibility & Safety

We are committed to promoting an open approach to the development of Vietnamese AI, believing that it fosters better and faster innovation. This initiative is designed to bolster the efforts of the Vietnamese AI community.

The Vista model is built for versatility across a broad spectrum of applications. However, it is important to note that it is not tailored to meet every specific developer preference for all conceivable use cases out-of-the-box. Such preferences are inherently diverse and vary significantly across different applications.

Ethical Considerations and Limitations

The responses from this model are not intended to offend or insult any individual or organization. Therefore, the answers provided should be considered as reference material only, and users should critically assess their accuracy.

The model still has significant limitations in terms of knowledge and practical task performance capabilities.

Future Work

We are committed to continuous improvement of the model, with specific plans to:

  1. Further train the finetuned model on diverse Vision Language tasks to enhance its performance.
  2. Improve the factual knowledge of the model, particularly to better adapt to Vietnamese cultural contexts.
  3. Investigate the combination of different vision encoders to capture more comprehensive image features.

Acknowledgement

We express our deep gratitude to various contributors and supporters of our project:

  • [LLaVA]: Significant portions of the source code and instructions were utilized from the LLaVA repository, with modifications to adapt to our model architecture.

  • [Vistral]: Immense thanks to the Vistral development team for creating an outstanding LLM for Vietnamese, accessible at Hugging Face - Vistral-7B-Chat.

  • [Siglip]: Grateful for the innovative multilingual vision encoder developed by the Siglip team, detailed in their research paper.

  • Sponsors: Special thanks to [VietAI] and [Google Cloud Research] for their diamond-level sponsorship, providing the computing resources essential for our project.

  • Mentors: Our heartfelt appreciation goes to our mentors, Anh Duong Nguyen and Thanh Le, for their guidance and support.

Citation Information

BibTeX:

@article{ViVLM Vista 2024,
  title={Vista},
  author={Bui, Hop Van and Ha, Hoang Huy and Tran, Oanh Ngoc and Phan, Phuc Van},
  year=2024,
  month=June},
  url={https://huggingface.co/Vi-VLM/Vista}
Downloads last month
0
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Vi-VLM/llava-vistral-7b-pretrain

Collection including Vi-VLM/llava-vistral-7b-pretrain