Maya: A Multilingual Vision Language Model

Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.

Model Description

Developed by: Cohere For AI Community
Model type: Multimodal Vision-Language Model
Language(s): English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi
License: Apache 2.0
Related Paper: Maya: An Instruction Finetuned Multilingual Multimodal Model

Model Details

Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:

Built on LLaVA framework using Aya-23 8B model
Uses SigLIP for vision encoding with multilingual adaptability
Supports 8 languages with strong cultural understanding
Trained on toxicity-filtered dataset for safer deployment

Model Architecture

Base Model: Aya-23 8B
Vision Encoder: SigLIP (multilingual)
Training Data: 558,000 images with multilingual annotations
Context Length: 8K tokens
Parameters: 8 billion

Intended Uses

Maya is designed for:

Multilingual visual question answering
Cross-cultural image understanding
Image captioning in multiple languages
Visual reasoning tasks
Document understanding

Usage

# Clone the Github repository
git clone https://github.com/nahidalam/maya

# Change the working directory
cd maya

# Run the following code
from llava.eval.talk2mayav2 import run_vqa_model

# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg" 

# Run model
answer = run_vqa_model(
    question=question,
    image_file=image_path
)

Model Performance

Performance across key benchmarks:

Model	English	Chinese	French	Spanish	Russian	Japanese	Arabic	Hindi	Avg.
Maya (8B)	61.5	61.7	61.0	60.4	62.2	63.7	63.4	64.0	60.4

Limitations

Limited to 8 languages currently
Requires high-quality images for optimal performance
May not capture nuanced cultural contexts in all cases
Performance varies across languages and tasks

Bias, Risks, and Limitations

Maya has been developed with attention to bias mitigation and safety:

Dataset filtered for toxic content
Cultural sensitivity evaluations performed
Regular bias assessments conducted
Limited to high-quality, vetted training data

However, users should be aware that:

Model may still exhibit biases present in training data
Performance may vary across different cultural contexts
Not suitable for critical decision-making applications

Training Details

Maya was trained using:

558,000 curated images
Multilingual annotations in 8 languages
Toxicity-filtered dataset
8xH100 GPUs with 80GB DRAM
Batch size of 32 (per device)
Learning rate of 1e-3 with cosine scheduler

Citation

@article{alam2024maya,
  title={Maya: An Instruction Finetuned Multilingual Multimodal Model},
  author={Alam, Nahid and Kanjula, Karthik Reddy and Guthikonda, Surya and Chung, Timothy and Vegesna, Bala Krishna S and Das, Abhipsha and Susevski, Anthony and Chan, Ryan Sze-Yin and Uddin, S M Iftekhar and Islam, Shayekh Bin and others},
  journal={arXiv preprint arXiv:placeholder},
  year={2024}
}

Contact

For questions or feedback about Maya, please:

Open an issue on our GitHub repository
Contact the maintainers at: [email protected]

maya-multimodal
/

maya

You need to agree to share your contact information to access this model