You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Maya: A Multilingual Vision Language Model

Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.

Model Description

Model Details

Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:

  • Built on LLaVA framework using Aya-23 8B model
  • Uses SigLIP for vision encoding with multilingual adaptability
  • Supports 8 languages with strong cultural understanding
  • Trained on toxicity-filtered dataset for safer deployment

Model Architecture

  • Base Model: Aya-23 8B
  • Vision Encoder: SigLIP (multilingual)
  • Training Data: 558,000 images with multilingual annotations
  • Context Length: 8K tokens
  • Parameters: 8 billion

Intended Uses

Maya is designed for:

  • Multilingual visual question answering
  • Cross-cultural image understanding
  • Image captioning in multiple languages
  • Visual reasoning tasks
  • Document understanding

Usage

# Clone the Github repository
git clone https://github.com/nahidalam/maya

# Change the working directory
cd maya
# Run the following code
from llava.eval.talk2mayav2 import run_vqa_model

# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg" 

# Run model
answer = run_vqa_model(
    question=question,
    image_file=image_path
)

Model Performance

Performance across key benchmarks:

Model English Chinese French Spanish Russian Japanese Arabic Hindi Avg.
Maya (8B) 61.5 61.7 61.0 60.4 62.2 63.7 63.4 64.0 60.4

Limitations

  • Limited to 8 languages currently
  • Requires high-quality images for optimal performance
  • May not capture nuanced cultural contexts in all cases
  • Performance varies across languages and tasks

Bias, Risks, and Limitations

Maya has been developed with attention to bias mitigation and safety:

  • Dataset filtered for toxic content
  • Cultural sensitivity evaluations performed
  • Regular bias assessments conducted
  • Limited to high-quality, vetted training data

However, users should be aware that:

  • Model may still exhibit biases present in training data
  • Performance may vary across different cultural contexts
  • Not suitable for critical decision-making applications

Training Details

Maya was trained using:

  • 558,000 curated images
  • Multilingual annotations in 8 languages
  • Toxicity-filtered dataset
  • 8xH100 GPUs with 80GB DRAM
  • Batch size of 32 (per device)
  • Learning rate of 1e-3 with cosine scheduler

Citation

@article{alam2024maya,
  title={Maya: An Instruction Finetuned Multilingual Multimodal Model},
  author={Alam, Nahid and Kanjula, Karthik Reddy and Guthikonda, Surya and Chung, Timothy and Vegesna, Bala Krishna S and Das, Abhipsha and Susevski, Anthony and Chan, Ryan Sze-Yin and Uddin, S M Iftekhar and Islam, Shayekh Bin and others},
  journal={arXiv preprint arXiv:placeholder},
  year={2024}
}

Contact

For questions or feedback about Maya, please:

Downloads last month
19
Safetensors
Model size
8.14B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for maya-multimodal/maya

Finetuned
(6)
this model

Datasets used to train maya-multimodal/maya