What is the id 9 in japanese-gpt2-medium tokenizer?

by gojiteji - opened Feb 7, 2023

Feb 7, 2023

I'm trying to use japanese-gpt2-medium for my research.
I found that sometimes the tokenizer outputs id 9 on the head like below.

print(tokenizer("hello world").input_ids)
>  [9, 22848, 463, 7375, 2]
print(tokenizer("dog").input_ids)
> [6832, 275, 2]

But it looks like number 9 decodes nothing.

print(tokenizer.decode([9]) ,len(tokenizer.decode([9])))
> 0

What does id 9 token mean? When fine-tuning, should id 9 be left?

tianyuz

Feb 10, 2023

It is a special symbol (meta symbol "▁" (U+2581)) produced by sentencepiece.
Please refer to the sentencepiece repo for details: https://github.com/google/sentencepiece

>>> tokenizer.tokenize("hello world")
['▁', 'hell', 'o', '▁world']
>>> tokenizer.tokenize("dog")
['▁do', 'g']

You can leave it as it is for finetuning.

gojiteji

Feb 10, 2023

Thank you for your reply.
I see.
I’ll do so.

gojiteji changed discussion status to closed Feb 10, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment