|
--- |
|
language: |
|
- de |
|
license: bigscience-bloom-rail-1.0 |
|
library_name: transformers |
|
tags: |
|
- ggml |
|
- bloom |
|
datasets: |
|
- oscar |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# BLOOM-CLP German (6.4B parameters) |
|
|
|
This is a monolingual German language model trained using the [CLP-Transfer](https://arxiv.org/abs/2301.09626) method based on [BLOOM-7b1](https://huggingface.co/bigscience/bloom-7b1). |
|
|
|
You can try out the model at [European Language Grid](https://live.european-language-grid.eu/catalogue/tool-service/20825/try%20out/). |
|
|
|
<span style="color:blue">UPDATE: We recently released an instruction-tuned version of this model: [malteos/bloom-6b4-clp-german-oasst-v0.1](https://huggingface.co/malteos/bloom-6b4-clp-german-oasst-v0.1)</span>. |
|
|
|
### How to use |
|
|
|
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we |
|
set a seed for reproducibility: |
|
|
|
```python |
|
>>> from transformers import pipeline, set_seed |
|
>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german') |
|
>>> set_seed(42) |
|
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3) |
|
|
|
[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."}, |
|
{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"}, |
|
{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},] |
|
``` |
|
|
|
|
|
## Training dataset |
|
|
|
- ca. 50B German tokens |
|
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult) |
|
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts) |
|
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets) |
|
- German court decisions from [Open Legal Data](http://openlegaldata.io/) |
|
|
|
## Code |
|
|
|
- [BigScience's Megatron-Deepspeed fork](https://github.com/bigscience-workshop/Megatron-DeepSpeed) |
|
|
|
## Hardware |
|
|
|
- 32xA100-40GB GPUs |
|
- 12.5 days |
|
- [Tensorboard logs](https://huggingface.co/malteos/bloom-6b4-clp-german-logs/tensorboard) |
|
|
|
## Evaluation |
|
|
|
Validation PPL compared to from-scratch training (the lower the better): |
|
|
|
<img alt="Tokens vs PPL" src="https://github.com/malteos/clp-transfer/raw/main/german-6b-ppl.png"> |
|
|
|
Additional evaluations can be found in [our paper](https://arxiv.org/abs/2301.09626). |
|
|
|
## How to cite |
|
|
|
If you are using our code or models, please cite [our paper](https://arxiv.org/abs/2301.09626): |
|
|
|
```bibtex |
|
@misc{Ostendorff2023clp, |
|
doi = {10.48550/ARXIV.2301.09626}, |
|
author = {Ostendorff, Malte and Rehm, Georg}, |
|
title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning}, |
|
publisher = {arXiv}, |
|
year = {2023} |
|
} |
|
|
|
``` |
|
|
|
## License |
|
|
|
[BigScience BLOOM RAIL 1.0](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) |
|
|