konstantindobler commited on
Commit
599b619
1 Parent(s): f2925cc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: de
3
+ license: apache-2.0
4
+ datasets: uonlp/CulturaX
5
+ ---
6
+
7
+ # mistral7b-de-tokenizer-swap-pure-bf16-v2-anneal-ablation
8
+
9
+ Mistral-7B-v0.1 adapted to German as part of our study on efficient language adaptation: "Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough".
10
+
11
+ Code: https://github.com/konstantinjdobler/tight-budget-llm-adaptation
12
+
13
+ Paper: https://openreview.net/forum?id=VYfJaHeVod
14
+
15
+ ## Usage
16
+ ```python
17
+ from transformers import AutoTokenizer, AutoModelForCausalLM
18
+
19
+ tokenizer = AutoTokenizer.from_pretrained("konstantindobler/mistral7b-de-tokenizer-swap-pure-bf16-v2-anneal-ablation")
20
+ model = AutoModelForCausalLM.from_pretrained("konstantindobler/mistral7b-de-tokenizer-swap-pure-bf16-v2-anneal-ablation")
21
+
22
+ # Use model and tokenizer as usual
23
+ ```
24
+
25
+ ## Details
26
+ The model is based on [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and was adapted to German.
27
+ The original tokenizer was replaced by a language-specific German tokenizer with a vocabulary of 32768 tokens. The new embeddings were initialized with [FOCUS](https://github.com/konstantinjdobler/focus).
28
+ The model was then trained on 8 billion German tokens from [uonlp/CulturaX](https://huggingface.co/uonlp/CulturaX) with pure bfloat16 precision (no mixed precision). However, in the final annealing phase of the learning rate schedule, the model was again trained using bfloat16 mixed precision. More details and hyperparameters can be found [in the paper](https://openreview.net/forum?id=VYfJaHeVod).
29
+
30
+ ## Disclaimer
31
+ The web-scale dataset used for pretraining and tokenizer training ([uonlp/CulturaX](https://huggingface.co/uonlp/CulturaX)) might contain personal and sensitive information.
32
+ Such behavior needs to be assessed carefully before any real-world deployment of the models.
33
+
34
+ ## Citation
35
+ Please cite as follows:
36
+
37
+ ```bibtex
38
+ @inproceedings{dobler2024language,
39
+ title={Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough},
40
+ author={Konstantin Dobler and Gerard de Melo},
41
+ booktitle={2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)},
42
+ year={2024},
43
+ url={https://openreview.net/forum?id=VYfJaHeVod}
44
+ }
45
+ ```