Question about tokenizer

by freQuensy23 - opened Jan 27

Jan 27

I've tried to use your model and can't understand some behaviour of its tokenizer:
tokenizer('1') returns 1, 29871, 29896.
1 is BOS token, 29896 is '1' token, but what does 29871 means???

When I decoded it back into string, i get: tokenizer.decode([29871]) = '' (empty string)
Can you explain me the purpuse of adding empty string to tokenizer's vocab?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment