dice-research
/

lola_v1

Text Generation

Model card Files Files and versions Community

lola_v1 / README.md

neo-nlp-dev's picture

Update README.md

59ed098 verified 7 months ago

|

1.86 kB

metadata

library_name: transformers
tags: []

Model Card for Model ID

Model Details

Model Description

Developed by: DICE Research Group (https://dice-research.org/) @ Paderborn University (https://www.uni-paderborn.de/)
Model type: GPT2 style (decoder-only) with Mixture-of-Experts layers
Language(s) (NLP): 160+
License: Coming soon
Repository: https://github.com/dice-group/LOLA-Megatron-DeepSpeed

How to Get Started with the Model

This pre-trained (causal language modeling) model can only be used for text-generation and requires further fine-tuning on downstream tasks.

How to use

You can use this model directly with a pipeline for text generation.

>>> from transformers import pipeline

>>> generator = pipeline('text-generation', model="dice-research/lola_v1", trust_remote_code=True)
>>> generator("The quick brown fox", max_length=13)
[{'generated_text': 'The quick brown fox jumps over the lazy dog.'}]

To use the top-k sampling, please set do_sample to True.

Note: The tokenizer used in the model comes from mGPT (https://github.com/ai-forever/mgpt)

Training Details

Training Framework

DeepSpeed Megatron (https://github.com/microsoft/Megatron-DeepSpeed)
Architecture type: Transformers (Decoder-only) with Mixture-of-Experts (MoE)
Number of Experts: 16
Model Size: 1.3 Billion Dense / 7.4 Billion Sparse

Pretraining Dataset

CulturaX (https://huggingface.co/datasets/uonlp/CulturaX)
Total Tokens: 6.3 Trillion
Total Languages: 167

LOLA v1 Training:

Computing cluster: Noctua2 (https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2)
Number of GPUs: 96x Nvidia A100 (40GB)
Training steps: 296000
Tokens consumed: 465 Billion
Training time: ~19 days