Text Generation
Transformers
Safetensors
lola_v1
custom_code
lola_v1 / README.md
neo-nlp-dev's picture
Update README.md
59ed098 verified
|
raw
history blame
1.86 kB
---
library_name: transformers
tags: []
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
- **Developed by:** DICE Research Group (https://dice-research.org/) @ Paderborn University (https://www.uni-paderborn.de/)
- **Model type:** GPT2 style (decoder-only) with Mixture-of-Experts layers
- **Language(s) (NLP):** 160+
- **License:** Coming soon
- **Repository:** https://github.com/dice-group/LOLA-Megatron-DeepSpeed
## How to Get Started with the Model
This pre-trained (causal language modeling) model can only be used for text-generation and requires further fine-tuning on downstream tasks.
### How to use
You can use this model directly with a pipeline for text generation.
```python
>>> from transformers import pipeline
>>> generator = pipeline('text-generation', model="dice-research/lola_v1", trust_remote_code=True)
>>> generator("The quick brown fox", max_length=13)
[{'generated_text': 'The quick brown fox jumps over the lazy dog.'}]
```
To use the top-k sampling, please set `do_sample` to `True`.
**Note:** The tokenizer used in the model comes from mGPT (https://github.com/ai-forever/mgpt)
## Training Details
### Training Framework
- DeepSpeed Megatron (https://github.com/microsoft/Megatron-DeepSpeed)
- Architecture type: Transformers (Decoder-only) with Mixture-of-Experts (MoE)
- Number of Experts: 16
- Model Size: 1.3 Billion Dense / 7.4 Billion Sparse
### Pretraining Dataset
- CulturaX (https://huggingface.co/datasets/uonlp/CulturaX)
- Total Tokens: 6.3 Trillion
- Total Languages: 167
### LOLA v1 Training:
- Computing cluster: Noctua2 (https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2)
- Number of GPUs: 96x Nvidia A100 (40GB)
- Training steps: 296000
- Tokens consumed: 465 Billion
- Training time: ~19 days