Vocab size does not match tokenizer config

by saattrupdan - opened 7 days ago

7 days ago

Great to see more Danish language models, well done!

I'm having trouble running the model, which is most likely due to the following clash in your configuration:

In your config.json, you write that the vocabulary size (vocab_size) is 32,000.
In your tokenizer_config.json, in the added_tokens_decoder mapping, you define several tokens with indices larger than 32,000.

So, it seems like either you need to increase the vocabulary size in the config, remove the extra tokens in the tokenizer config, or change the indices of the extra tokens defined in the tokenizer config (if the current indices are wrong).

saattrupdan

7 days ago

•

edited 7 days ago

Some concrete comments and suggestions:

CLS, SEP and MASK token in the extra tokens are not used in practice
What is the EOD token?
The padding side in the tokenizer config for decoders should be 'left'
The model_max_length is currently the default value, essentially being infinity. Since the model is based on Llama-2, this should be changed to 4,096 I suppose?
A solution to the vocab/extra-tokens issue could maybe be that the vocab size in the config should be raised to 30,005?

saattrupdan

7 days ago

This comment has been hidden

jjzha

NLPnorth org 7 days ago

Hi Dan! Thanks for the interest and also the great pointers.

We further pre-trained this model using the megatronLLM library, and they got similar problems with the tokenizer: https://huggingface.co/epfl-llm/meditron-7b/discussions/5
I will discuss this with my colleagues. So we will get back to you asap.

As far as I'm concerned, we can raise the vocab size in the config to 30,005. The padding side for the decoder is "right" in the original llama2 config as well, including the infinite model_max_length: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/tokenizer_config.json, but it makes sense to change it to 4,096.

Have you seen the padding side left to be working better?

Will keep you updated.

saattrupdan

7 days ago

@jjzha Thanks for your quick reply!

Yeah it seems exactly like the issue you linked to. I checked the size of the embedding layer of the model, which is 30,000, so the tokenizer really shouldn't use any tokens with index >=30,000. In my use case the errors happen with the EOD and PAD tokens, but I suppose these could just be set to the EOS token (aka </s>). Doing this, as well as removing the other unneeded extra tokens, should hopefully fix things. The tokenizer's base vocab without these is 30,000, so that seems fine.

As for the padding, Meta doesn't use padding at all when they're pretraining models, so the models haven't been trained with any padding token (which is why it's added on top). Left padding is usually used for generative models, since otherwise you end up with samples in your batch like "I really like <pad", but if it then generates "ice cream" then the resulting document will be "I really like ice cream", which I guess just seems quite unintuitive. For that reason, left padding tends to be the standard for generative models. But I suppose this mainly matters if you're finetuning the model.

jjzha

NLPnorth org 3 days ago

•

edited 3 days ago

Hi @saattrupdan ,

We now edited the following in both the base and instruct version:

Removed the unnecessary instances of special tokens (e.g., CLS, SEP, etc.) in special_tokens_map.json and tokenizer_config.json;
Set model_max_length to 4096;
Set padding side to left;
In the instruct version added the "chat template" (i.e., [INST] [/INST]).

Let us know if this all works.

Cheers!

saattrupdan

about 1 hour ago

Hi @jjzha . You still have the added_tokens.json file, which defines the tokens with index >=30,000, that needs to be removed.

A good sanity check that you can do locally is just to load the tokeniser and print it - here's how it looks for me currently:

>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('NLPnorth/snakmodel-7b-base')
LlamaTokenizerFast(name_or_path='NLPnorth/snakmodel-7b-base', vocab_size=32000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
    0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    32000: AddedToken("<CLS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    32001: AddedToken("<SEP>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    32002: AddedToken("<EOD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    32003: AddedToken("<MASK>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    32004: AddedToken("<PAD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

We see that it still registers the extra tokens, probably due to the existence of the above-mentioned added_tokens.json file.

jjzha

NLPnorth org 31 minutes ago

Hi Dan,

You're correct, we also needed to edit the tokenizer.jsonfile. I now did this, let me know if this helps!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment