google/gemma-1.1-7b-it · Fix slow tokenizer

Fix slow tokenizerfc76c780

pcuenq

Google org Apr 10

No description provided.

pcuenq

Google org Apr 10

Before:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it")
tokenizer.decode([139])

'▁▁'

With this PR:

tokenizer  = GemmaTokenizer.from_pretrained("google/gemma-1.1-7b-it", revision="refs/pr/14")
tokenizer.decode([139])

'  '

This was introduced when the user defined tokens were added, in a PR similar to this one: https://huggingface.co/google/gemma-1.1-2b-it/discussions/1.

I don't know if we should keep them in the config using character " " instead of "▁". I tested the slow and fast tokenizers for encoding and decoding but not sure if I'm missing something.

ArthurZ

Google org Apr 10

>>> tokenizer.convert_ids_to_tokens([139])
>>> ['▁▁']

which means the token is correct, but the decode function is not. tokenizers does not complain. But transformers never post-processes the replacements.
These should be the only affected tokens. (because based on the replacement).

Could also have a fix in transformers

pcuenq changed pull request status to merged Apr 16