What is the id 9 in japanese-gpt2-medium tokenizer?
#1
by
gojiteji
- opened
I'm trying to use japanese-gpt2-medium for my research.
I found that sometimes the tokenizer outputs id 9 on the head like below.
print(tokenizer("hello world").input_ids)
> [9, 22848, 463, 7375, 2]
print(tokenizer("dog").input_ids)
> [6832, 275, 2]
But it looks like number 9 decodes nothing.
print(tokenizer.decode([9]) ,len(tokenizer.decode([9])))
> 0
What does id 9 token mean? When fine-tuning, should id 9 be left?
It is a special symbol (meta symbol "▁" (U+2581)) produced by sentencepiece.
Please refer to the sentencepiece repo for details: https://github.com/google/sentencepiece
>>> tokenizer.tokenize("hello world")
['▁', 'hell', 'o', '▁world']
>>> tokenizer.tokenize("dog")
['▁do', 'g']
You can leave it as it is for finetuning.
Thank you for your reply.
I see.
I’ll do so.
gojiteji
changed discussion status to
closed