Maya: A Multilingual Vision Language Model
Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.
Model Description
- Developed by: Cohere For AI Community
- Model type: Multimodal Vision-Language Model
- Language(s): English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi
- License: Apache 2.0
- Related Paper: Maya: An Instruction Finetuned Multilingual Multimodal Model
Model Details
Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:
- Built on LLaVA framework using Aya-23 8B model
- Uses SigLIP for vision encoding with multilingual adaptability
- Supports 8 languages with strong cultural understanding
- Trained on toxicity-filtered dataset for safer deployment
Model Architecture
- Base Model: Aya-23 8B
- Vision Encoder: SigLIP (multilingual)
- Training Data: 558,000 images with multilingual annotations
- Context Length: 8K tokens
- Parameters: 8 billion
Intended Uses
Maya is designed for:
- Multilingual visual question answering
- Cross-cultural image understanding
- Image captioning in multiple languages
- Visual reasoning tasks
- Document understanding
Usage
# Clone the Github repository
git clone https://github.com/nahidalam/maya
# Change the working directory
cd maya
# Run the following code
from llava.eval.talk2mayav2 import run_vqa_model
# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg"
# Run model
answer = run_vqa_model(
question=question,
image_file=image_path
)
Model Performance
Performance across key benchmarks:
Model | English | Chinese | French | Spanish | Russian | Japanese | Arabic | Hindi | Avg. |
---|---|---|---|---|---|---|---|---|---|
Maya (8B) | 61.5 | 61.7 | 61.0 | 60.4 | 62.2 | 63.7 | 63.4 | 64.0 | 60.4 |
Limitations
- Limited to 8 languages currently
- Requires high-quality images for optimal performance
- May not capture nuanced cultural contexts in all cases
- Performance varies across languages and tasks
Bias, Risks, and Limitations
Maya has been developed with attention to bias mitigation and safety:
- Dataset filtered for toxic content
- Cultural sensitivity evaluations performed
- Regular bias assessments conducted
- Limited to high-quality, vetted training data
However, users should be aware that:
- Model may still exhibit biases present in training data
- Performance may vary across different cultural contexts
- Not suitable for critical decision-making applications
Training Details
Maya was trained using:
- 558,000 curated images
- Multilingual annotations in 8 languages
- Toxicity-filtered dataset
- 8xH100 GPUs with 80GB DRAM
- Batch size of 32 (per device)
- Learning rate of 1e-3 with cosine scheduler
Citation
@article{alam2024maya,
title={Maya: An Instruction Finetuned Multilingual Multimodal Model},
author={Alam, Nahid and Kanjula, Karthik Reddy and Guthikonda, Surya and Chung, Timothy and Vegesna, Bala Krishna S and Das, Abhipsha and Susevski, Anthony and Chan, Ryan Sze-Yin and Uddin, S M Iftekhar and Islam, Shayekh Bin and others},
journal={arXiv preprint arXiv:placeholder},
year={2024}
}
Contact
For questions or feedback about Maya, please:
- Open an issue on our GitHub repository
- Contact the maintainers at: [email protected]
- Downloads last month
- 19
Inference API (serverless) does not yet support transformers models for this pipeline type.
Model tree for maya-multimodal/maya
Base model
CohereForAI/aya-23-8B