Size of vocab of tokenizer and size of embedding layer /model vocab is different

#29
by pratikkumar2008 - opened

Size of tokenizer vocab is 50257, while size of vocab in config is 51200. Any particular reason for this? Also how should we deal when we add extra tokens in tokenizer (Resizing of embedding layer of model).
https://huggingface.co/microsoft/phi-1_5/blob/main/config.json

from transformers import AutoTokenizer
pretrained_model_name_or_path="~/phi-1_5"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path)
print(tokenizer.vocab_size)
50257

m_my=torch.load(pretrained_model_name_or_path)
for (k1,v1) in m_my.items():
print(k1, v1.shape)

layers.0.wte.weight torch.Size([51200, 2048])

image.png

Microsoft org

Hello @pratikkumar2008 !

These extra tokens won't matter since they won't be used when passing through the Embedding layer. This mismatch happened because 51200 is a multiple of 64 and takes advantage of the faster cores of Ampere-based GPUs.

In theory, it should have been 50304, which is the closest multiple of 64 (related to 50257).

Hi there,

I understand that it works fine as long as tokenizer.vocab_size <= model.layers[0].wte.weight.shape[0], but it seems that the number 50257 is actually incorrect.
When you count unique indices in the vocabulary, including added_tokens, the correct number appears to be 50295 instead.
I am not knowledgeable about how this attribute is configured when initializing the tokenizer, but this issue may need to be fixed because sometimes we want to access the value through this attribute (tokenizer.vocab_size).

Microsoft org
β€’
edited Oct 3, 2023

This is the expected behavior of transformers. Please check this issue: https://github.com/huggingface/transformers/issues/12632

I'm afraid but the link you suggested doesn't seem very relevant to the issue.

Of course, we can get the actual vocabulary size with len(tokenizer.get_vocab()) or something.
However, the added_tokens are incorporated by default without users specifying them, as defined in tokenizer.json.
Given that the argument is supposed to be passed by users, I would not consider this as an "expected behavior" of the library.
The current implementation can cause errors for future users relying on the (presumably widely used)vocab_size attribute, so it would be better off corrected, maybe by moving the additional tokens into the default ones.

Thanks for your response.

Microsoft org
β€’
edited Oct 4, 2023

Hi @kyo-takano .

I definitely agree with you, but this is something out of our hands since transformers is the one handling the tokenizer-related files, i.e., they decided not to add the added_tokens to the vocab_size. We are just following the same pattern as other models using the CodeGen tokenizer.

We can incorporate the added_tokens.json tokens to the vocab key of the tokenizer.jsonfile, but this is just a "hack" since new added_tokens by users will continue to not reflect on the vocab_size property.

gugarosa changed discussion status to closed

Sign up or log in to comment