terrycraddock commited on
Commit
c8bc9cc
1 Parent(s): e14226e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -1,3 +1,15 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ Custom tokenizer used to tokenize my dataset terrycraddock/GPT2-PretrainV1-en. This tokenizer is basically the default gpt2 tokenizer except I added a [PAD] token for
6
+ the trainer to perform distilled knowledge training as well as trained this on my dataset to be more efficent.
7
+
8
+ Link to tokenized dataset:
9
+ https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-Tokenized-en
10
+
11
+ Link to non-tokenized dataset:
12
+ https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-en
13
+
14
+ This customer tokenizer as well as the datasets mentioned are intended to be used as a pretrain corpus for knowledge distillation learning from a larger GPT2 model to a
15
+ smaller custom one.