RichardErkhov commited on
Commit
d885408
1 Parent(s): 8057b28

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ gpt-neo-125M-dutch - bnb 4bits
11
+ - Model creator: https://huggingface.co/yhavinga/
12
+ - Original model: https://huggingface.co/yhavinga/gpt-neo-125M-dutch/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language: nl
20
+ widget:
21
+ - text: "In het jaar 2030 zullen we"
22
+ - text: "Toen ik gisteren volledig in de ban was van"
23
+ - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
24
+ - text: "In Israël was een strenge lockdown"
25
+ tags:
26
+ - gpt2-medium
27
+ - gpt2
28
+ pipeline_tag: text-generation
29
+ datasets:
30
+ - yhavinga/mc4_nl_cleaned
31
+ ---
32
+ # GPT-Neo 125M pre-trained on cleaned Dutch mC4 🇳🇱
33
+
34
+ A GPT-Neo small model (125M paramters) trained from scratch on Dutch, with perplexity 20.9 on cleaned Dutch mC4.
35
+
36
+ ## How To Use
37
+
38
+ You can use this GPT-Neo model directly with a pipeline for text generation.
39
+
40
+ ```python
41
+ MODEL_DIR='yhavinga/gpt-neo-125M-dutch'
42
+ from transformers import pipeline, GPT2Tokenizer, GPTNeoForCausalLM
43
+ tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
44
+ model = GPTNeoForCausalLM.from_pretrained(MODEL_DIR)
45
+ generator = pipeline('text-generation', model, tokenizer=tokenizer)
46
+
47
+ generated_text = generator('Wetenschappers verbonden aan de Katholieke Universiteit', max_length=256, do_sample=True, top_k=50, top_p=0.95, temperature=0.7, no_repeat_ngram_size=2))
48
+ ```
49
+
50
+ *"Wetenschappers verbonden aan de Katholieke Universiteit van Nijmegen" - "hebben ontdekt dat de genen die een mens heeft, een enorme invloed hebben op het DNA van zijn lichaam.
51
+ Cellen kunnen zich beter binden aan het DNA dan andere soorten cellen. De genen die de cellen maken, zijn bepalend voor de groei van de cel.
52
+ Het DNA van een mens is niet alleen informatiedrager, maar ook een bouwstof voor het DNA. Het wordt gevonden in de genen van een cel. Als er op een cel een cel"*
53
+
54
+ ## Tokenizer
55
+
56
+ * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
57
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
58
+
59
+ ## Dataset
60
+
61
+ This model was trained on of the `full` configuration (33B tokens) of
62
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
63
+ which is the original mC4, except
64
+
65
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
66
+ * Sentences with less than 3 words are removed
67
+ * Sentences with a word of more than 1000 characters are removed
68
+ * Documents with less than 5 sentences are removed
69
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
70
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
71
+
72
+ ## Models
73
+
74
+ TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
75
+
76
+ * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
77
+
78
+ | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
79
+ |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
80
+ | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 20.9 | 3.04 | 128 | 1 | 190000/558608 | adam | 2.4e-3 | 1d 12h | full |
81
+ | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 1 | 320000/520502 | adam | 8e-4 | 7d 2h | full |
82
+ | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
83
+ | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
84
+
85
+
86
+ ## Acknowledgements
87
+
88
+ This project would not have been possible without compute generously provided by Google through the
89
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
90
+ instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
91
+ and training the models:
92
+
93
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
94
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
95
+ * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
96
+ * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
97
+
98
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
99
+