arxiv:2412.07112

Maya: An Instruction Finetuned Multilingual Multimodal Model

Published on Dec 10

· Submitted by

kkr5155 on Dec 10

Upvote

Authors:

Karthik Reddy Kanjula ,

Surya Guthikonda ,

Bala Krishna S Vegesna ,

Ryan Sze-Yin Chan ,

Drishti Sharma ,

Isha Chaturvedi ,

Abstract

The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.

View arXiv page View PDF Add to collection

Community

kkr5155

Paper author Paper submitter 1 day ago

A New Multimodal Multilingual Vision-Language Model. Maya is completely open source, open weight and open dataset, designed to handle 8 languages, cultural diversity, and nuanced real-world contexts in vision-language models.