File size: 693 Bytes
1768013
 
 
c8bc9cc
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
---
license: mit
---

Custom tokenizer used to tokenize my dataset terrycraddock/GPT2-PretrainV1-en. This tokenizer is basically the default gpt2 tokenizer except I added a [PAD] token for 
the trainer to perform distilled knowledge training as well as trained this on my dataset to be more efficent. 

Link to tokenized dataset: 
https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-Tokenized-en

Link to non-tokenized dataset:
https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-en

This customer tokenizer as well as the datasets mentioned are intended to be used as a pretrain corpus for knowledge distillation learning from a larger GPT2 model to a 
smaller custom one.