--- license: apache-2.0 language: - en - es - de - fr - it pipeline_tag: text-generation --- ![image/png](https://huggingface.co/datasets/malteos/images/resolve/main/occiglot.medium.png) # Occiglot-7B-EU5 > A [polyglot](https://en.wikipedia.org/wiki/Multilingualism#In_individuals) language model for the [Occident](https://en.wikipedia.org/wiki/Occident). > **Occiglot-7B-EU5** is a generative language model with 7B parameters supporting the top-5 EU languages (English, Spanish, French, German, and Italian) and trained by the [German Research Center for Artificial Intelligence (DFKI)](https://www.dfki.de/en/web). It is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and trained on 293B tokens of additional multilingual and code data with a block size of 8,192 tokens per sample. Note that the model is a general-purpose base model and was not instruction-fine-tuned nor optimized for chat or other applications. This is the first release of an ongoing open research project for multilingual language models. If you want to train a model for your own language or are working on evaluations, please contact us. **We are open for collaborations!** ### Model details - **Continued-pretraining from:** [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) - **Model type:** Causal decoder-only transformer language model - **Languages:** English, Spanish, French, German, Italian, and code. - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) - **Compute resources:** [HessianAI's 42](https://hessian.ai/) - **Contributors:** Manuel Brack, Patrick Schramowski, Pedro Ortiz, Malte Ostendorff, Fabio Barth, Georg Rehm, Kristian Kersting - **Research labs:** [SAINT](https://www.dfki.de/en/web/research/research-departments/foundations-of-systems-ai) and [SLT](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology) ### How to use You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: ```python >>> from transformers import pipeline, set_seed >>> generator = pipeline('text-generation', model='occiglot/occiglot-7b-eu5') >>> set_seed(42) >>> generator("Hallo, Ich bin ein Sprachmodell,", max_length=40, num_return_sequences=1) [{'generated_text': 'Hallo, Ich bin ein Sprachmodell, das dir bei der Übersetzung von Texten zwischen Deutsch und Englisch helfen kann. Wenn du mir einen Text in Deutsch'}] ``` ## Dataset The training data was split amongst the 4 target languages (de, es, fr, it) and the continuous training in English and code. The data distribution by language (estimated) is as follows: - English: ~13% - Code: ~5% - German: ~20% - Spanish: ~20% - French: ~20% - Italian: ~20% The training data was prepared using [lm-datasets](https://github.com/malteos/lm-datasets). The exact data configuration is [here](https://huggingface.co/occiglot/occiglot-7b-eu5/blob/main/lm-datasets-config.yml). ## Training settings - Continual pre-training on 128 x A100-80GB on [HessianAI's 42](https://hessian.ai/). - Framework: [Determined](https://www.determined.ai/) - Precision: bf16 - Optimizer: AdamW (lr: 0.00001, warmup_steps: 420) - Global batch size: 512 (with 8192 blocksize) split over 128 GPUs - Cosine Annealing with Warmup ## Tokenizer Tokenizer is unchanged from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). ## Evaluation Preliminary evaluation results can be found below. Please note that the non-English results are based on partially machine-translated datasets and English prompts ([Belebele](https://huggingface.co/datasets/facebook/belebele) and [Okapi framework](https://github.com/nlp-uoregon/Okapi)) and thus should be interpreted with caution, e.g., biased towards English model performance. Currently, we are working on more suitable benchmarks for Spanish, French, German, and Italian. ### All languages | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.5277 | 0.6825 | 0.7687 | 0.6287 | 0.6519 | | leo-mistral-hessianai-7b | 0.4614 | 0.6423 | 0.6524 | 0.5440 | 0.5750 | | Occiglot-7B-EU5 | 0.5083 | 0.7191 | 0.6758 | 0.5432 | 0.6116 | ### English | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.6143 | 0.8344 | 0.8444 | 0.6351 | 0.7321 | | leo-mistral-hessianai-7b | 0.5213 | 0.7779 | 0.7356 | 0.5508 | 0.6464 | | Occiglot-7B-EU5 | 0.5307 | 0.7900 | 0.7267 | 0.5467 | 0.6485 | ### German | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.4765 | 0.6101 | 0.7411 | 0.5274 | 0.5888 | | leo-mistral-hessianai-7b | 0.4739 | 0.6818 | 0.6900 | 0.4887 | 0.5836 | | Occiglot-7B-EU5 | 0.4944 | 0.6667 | 0.6467 | 0.4833 | 0.5728 | ### Spanish | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.5256 | 0.6728 | 0.7478 | 0.5432 | 0.6224 | | leo-mistral-hessianai-7b | 0.4436 | 0.5970 | 0.6178 | 0.4359 | 0.5236 | | Occiglot-7B-EU5 | 0.5085 | 0.7255 | 0.6778 | 0.4997 | 0.6029 | ### French | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.5244 | 0.6651 | 0.7744 | 0.5413 | 0.6263 | | leo-mistral-hessianai-7b | 0.4354 | 0.5967 | 0.6222 | 0.4326 | 0.5217 | | Occiglot-7B-EU5 | 0.5064 | 0.7125 | 0.6756 | 0.4959 | 0.5976 | ### Italian | **model_name** | **arc_challenge** | **hellaswag** | **belebele** | **mmlu** | **avg** | |--------------------------|-------------------|---------------|--------------|----------|---------| | Mistral-7B-v0.1 | 0.4979 | 0.6303 | 0.7356 | 0.5372 | 0.6002 | | leo-mistral-hessianai-7b | 0.4328 | 0.5580 | 0.5967 | 0.4311 | 0.5047 | | Occiglot-7B-EU5 | 0.5013 | 0.7008 | 0.6522 | 0.4949 | 0.5873 | ## Acknowledgements The model training was supported by a compute grant at the [42 supercomputer](https://hessian.ai/) which is a central component in the development of [hessian AI](https://hessian.ai/), the [AI Innovation Lab](https://hessian.ai/infrastructure/ai-innovationlab/) (funded by the [Hessian Ministry of Higher Education, Research and the Art (HMWK)](https://wissenschaft.hessen.de) & the [Hessian Ministry of the Interior, for Security and Homeland Security (HMinD)](https://innen.hessen.de)) and the [AI Service Centers](https://hessian.ai/infrastructure/ai-service-centre/) (funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html)). The curation of the training data is partially funded by the [German Federal Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/Navigation/EN/Home/home.html) through the project [OpenGPT-X](https://opengpt-x.de/en/) (project no. 68GX21007D). ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)