Fix slow tokenizer
#14
by
pcuenq
HF staff
- opened
No description provided.
Before:
tokenizer = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it")
tokenizer.decode([139])
'▁▁'
With this PR:
tokenizer = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it", revision="refs/pr/14")
tokenizer.decode([139])
' '
This was introduced when the user defined tokens were added, in a PR similar to this one: https://huggingface.co/google/gemma-1.1-2b-it/discussions/1.
I don't know if we should keep them in the config using character " "
instead of "▁"
. I tested the slow and fast tokenizers for encoding and decoding but not sure if I'm missing something.
>>> tokenizer.convert_ids_to_tokens([139])
>>> ['▁▁']
which means the token is correct, but the decode
function is not. tokenizers
does not complain. But transformers
never post-processes the replacements.
These should be the only affected tokens. (because based on the replacement).
Could also have a fix in transformers
pcuenq
changed pull request status to
merged