|
--- |
|
library_name: transformers |
|
base_model: meta-llama/Llama-2-7b-hf |
|
license: llama2 |
|
pipeline_tag: text-generation |
|
language: |
|
- multilingual |
|
datasets: |
|
- cis-lmu/Glot500 |
|
--- |
|
|
|
# MaLA-500: Massive Language Adaptation of Large Language Models |
|
|
|
MaLA-500 is a novel large language model designed to cover an extensive range of 534 languages. This model builds upon LLaMA 2 7B and integrates continued pretraining with vocabulary extension, with an expanded vocabulary size of 260,164, and LoRA low-rank adaptation. |
|
|
|
|
|
- **Continued Pretraining:** Enhances the model's ability to adapt to a wide range of languages. |
|
- **LoRA Low-Rank Adaptation:** LoRA low-rank adaptation refines the model's adaptation capabilities. |
|
- **Vocabulary Extension:** MaLA-500 boasts an extended vocabulary size of 260,164. |
|
- **Multilingual Proficiency:** Trained on Glot500-c, covering 534 languages. |
|
|
|
Please refer to [our paper](https://arxiv.org/pdf/2401.13303v1.pdf) for more details. |
|
|
|
## How to Get Started with the Model |
|
|
|
Requirements: |
|
``` |
|
transformers>=4.36.1 |
|
peft>=0.6.2 |
|
``` |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
|
|
base_model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf') |
|
base_model.resize_token_embeddings(260164) |
|
tokenizer = AutoTokenizer.from_pretrained('MaLA-LM/mala-500') |
|
model = PeftModel.from_pretrained(base_model, 'MaLA-LM/mala-500') |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{lin2024mala500, |
|
title={MaLA-500: Massive Language Adaptation of Large Language Models}, |
|
author={Peiqin Lin and Shaoxiong Ji and Jörg Tiedemann and André F. T. Martins and Hinrich Schütze}, |
|
year={2024}, |
|
eprint={2401.13303}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |