ArthurConmy's picture
Update README.md
80feb2b
|
raw
history blame
1.77 kB

This is https://huggingface.co/NeelNanda/gpt-neox-tokenizer-digits but with a fix that makes this work on tokenizers == 0.14 after a breaking change involving added_tokens: https://github.com/huggingface/tokenizers/issues/1358

The two changes from NeelNanda/gpt-neox-tokenizer-digits are

  1. (Important) we remove the space tokens from the "added_tokens" key in tokenizer.json here: https://huggingface.co/ArthurConmy/alternative-neel-tokenizer/blob/main/tokenizer.json . These caused the breaking change along with the tokenizers PR above
  2. (Not important) we use GPTNeoXTokenizer rather than PretrainedTokenizerFast in tokenizer_config.json as this seemed to match what GPT-NeoX did

Neel's README

This is a fork of the GPT NeoX 20B tokenizer, edited to split every numerical digit into a separate token. This has the goal of making it easier for the model to learn arithmetic capabilities and to hopefully be more interpretable, and copies the idea from the PaLM tokenizer.

This was done, extremely hackily, by just removing every token that contained "\d\d" (eg "2013"). All remaining digit containing tokens are "0" ... "9" and " 0" ... " 9"

This comes at the cost of making modelling normal text harder, since eg dates like 2013 which naturally should be a single token are now 2|0|1|3.

This has a reduced vocab size of 48252 (several of the tokens towards the end are special whitespace tokens copied in from GPT-NeoX to make tokenizing code easier - some of these are duplicated in the vocabulary and thus may not actually show up at train time).

It includes a padding token (<|PAD|>) an End-Of-String token (<|EOS|>) and a Beginning-Of-String token (<|BOS|>)