Vocab size does not match tokenizer config
Great to see more Danish language models, well done!
I'm having trouble running the model, which is most likely due to the following clash in your configuration:
- In your
config.json
, you write that the vocabulary size (vocab_size
) is 32,000. - In your
tokenizer_config.json
, in theadded_tokens_decoder
mapping, you define several tokens with indices larger than 32,000.
So, it seems like either you need to increase the vocabulary size in the config, remove the extra tokens in the tokenizer config, or change the indices of the extra tokens defined in the tokenizer config (if the current indices are wrong).
Some concrete comments and suggestions:
- CLS, SEP and MASK token in the extra tokens are not used in practice
- What is the EOD token?
- The padding side in the tokenizer config for decoders should be 'left'
- The
model_max_length
is currently the default value, essentially being infinity. Since the model is based on Llama-2, this should be changed to 4,096 I suppose? - A solution to the vocab/extra-tokens issue could maybe be that the vocab size in the config should be raised to 30,005?
Hi Dan! Thanks for the interest and also the great pointers.
We further pre-trained this model using the megatronLLM library, and they got similar problems with the tokenizer: https://huggingface.co/epfl-llm/meditron-7b/discussions/5
I will discuss this with my colleagues. So we will get back to you asap.
As far as I'm concerned, we can raise the vocab size in the config to 30,005. The padding side for the decoder is "right" in the original llama2 config as well, including the infinite model_max_length
: https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/tokenizer_config.json, but it makes sense to change it to 4,096.
Have you seen the padding side left
to be working better?
Will keep you updated.
@jjzha Thanks for your quick reply!
Yeah it seems exactly like the issue you linked to. I checked the size of the embedding layer of the model, which is 30,000, so the tokenizer really shouldn't use any tokens with index >=30,000. In my use case the errors happen with the EOD and PAD tokens, but I suppose these could just be set to the EOS token (aka </s>
). Doing this, as well as removing the other unneeded extra tokens, should hopefully fix things. The tokenizer's base vocab without these is 30,000, so that seems fine.
As for the padding, Meta doesn't use padding at all when they're pretraining models, so the models haven't been trained with any padding token (which is why it's added on top). Left padding is usually used for generative models, since otherwise you end up with samples in your batch like "I really like <pad", but if it then generates "ice cream" then the resulting document will be "I really like ice cream", which I guess just seems quite unintuitive. For that reason, left padding tends to be the standard for generative models. But I suppose this mainly matters if you're finetuning the model.
Hi @saattrupdan ,
We now edited the following in both the base
and instruct
version:
- Removed the unnecessary instances of special tokens (e.g., CLS, SEP, etc.) in
special_tokens_map.json
andtokenizer_config.json
; - Set model_max_length to
4096
; - Set padding side to
left
; - In the
instruct
version added the "chat template" (i.e., [INST] [/INST]).
Let us know if this all works.
Cheers!
Hi
@jjzha
. You still have the added_tokens.json
file, which defines the tokens with index >=30,000, that needs to be removed.
A good sanity check that you can do locally is just to load the tokeniser and print it - here's how it looks for me currently:
>>> from transformers import AutoTokenizer
>>> AutoTokenizer.from_pretrained('NLPnorth/snakmodel-7b-base')
LlamaTokenizerFast(name_or_path='NLPnorth/snakmodel-7b-base', vocab_size=32000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32000: AddedToken("<CLS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32001: AddedToken("<SEP>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32002: AddedToken("<EOD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32003: AddedToken("<MASK>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32004: AddedToken("<PAD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
We see that it still registers the extra tokens, probably due to the existence of the above-mentioned added_tokens.json
file.
Hi Dan,
You're correct, we also needed to edit the tokenizer.json
file. I now did this, let me know if this helps!