File size: 693 Bytes
1768013 c8bc9cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
---
license: mit
---
Custom tokenizer used to tokenize my dataset terrycraddock/GPT2-PretrainV1-en. This tokenizer is basically the default gpt2 tokenizer except I added a [PAD] token for
the trainer to perform distilled knowledge training as well as trained this on my dataset to be more efficent.
Link to tokenized dataset:
https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-Tokenized-en
Link to non-tokenized dataset:
https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-en
This customer tokenizer as well as the datasets mentioned are intended to be used as a pretrain corpus for knowledge distillation learning from a larger GPT2 model to a
smaller custom one. |