--- library_name: transformers license: cc-by-4.0 datasets: - uonlp/CulturaX --- # Model Card for Model ID # LOLA — An Open-Source Massively Multilingual Large Language Model ## Model Description - **Developed by:** DICE Research Group (https://dice-research.org/) @ Paderborn University (https://www.uni-paderborn.de/) - **Model type:** GPT2 style (decoder-only) with alternating sparse Mixture-of-Experts layers - **Number of Experts**: 16 - **Model Size**: 1.3 Billion (active*) / 7.4 Billion (total) - **Language(s) (NLP):** 160+ - **License:** CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) - **Repository:** https://github.com/dice-group/LOLA * The number of parameters a model utilizes per token (ref: [Du et al, 2022](https://arxiv.org/abs/2112.06905)). This distinction is crucial for understanding the efficiency and performance of MoE models. ## How to Get Started with the Model This pre-trained (causal language modeling) model can only be used for text-generation and requires further fine-tuning on downstream tasks. ### How to use You can use this model directly with a pipeline for text generation. ```python >>> from transformers import pipeline >>> generator = pipeline('text-generation', model="dice-research/lola_v1", trust_remote_code=True) >>> generator("The quick brown fox", max_length=13) [{'generated_text': 'The quick brown fox jumps over the lazy dog.'}] ``` To use the top-k sampling, please set `do_sample` to `True`. **Note:** The tokenizer used in the model comes from mGPT (https://github.com/ai-forever/mgpt) ## Training Details ### Training Framework - DeepSpeed Megatron (https://github.com/microsoft/Megatron-DeepSpeed) - Architecture type: Transformers (Decoder-only) with Mixture-of-Experts (MoE) - Number of Experts: 16 - Model Size: 1.3 Billion Dense / 7.4 Billion Sparse ### Pretraining Dataset - CulturaX (https://huggingface.co/datasets/uonlp/CulturaX) - Total Tokens: 6.3 Trillion - Total Languages: 167 ### LOLA v1 Training: - Computing cluster: Noctua2 (https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2) - Number of GPUs: 96x Nvidia A100 (40GB) - Training steps: 296000 - Tokens consumed: 465 Billion - Training time: ~19 days ## Citation If you use our work in your research, please make sure to cite it: ```bibtex @misc{srivastava2024lolaopensourcemassively, title={LOLA -- An Open-Source Massively Multilingual Large Language Model}, author={Nikit Srivastava and Denis Kuchelev and Tatiana Moteu Ngoli and Kshitij Shetty and Michael Roeder and Diego Moussallem and Hamada Zahera and Axel-Cyrille Ngonga Ngomo}, year={2024}, eprint={2409.11272}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.11272}, } ```