fix vocab size

by jphme - opened Nov 30, 2023

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

-1

jphme

Nov 30, 2023

from transformers import AutoTokenizer
testtokenizer=AutoTokenizer.from_pretrained("LeoLM/leo-mistral-hessianai-7b-chat")
len(testtokenizer)
# 32002

Leads to e.g. VLLM error:
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
(see here)

fix vocab size966d6a51

bjoernp

LAION LeoLM org Nov 30, 2023

Have you tested this? The model's weights have 32128 embedding dim so I feel like this would break no?

jphme

Nov 30, 2023

Have you tested this? The model's weights have 32128 embedding dim so I feel like this would break no?

No I didn't test this and according to the docs you could be right see here .

Does it work with VLLM for you? See also example config.json from OpenOrca for comparison. Probably related to resize_token_embeddings_to_32x (but why its not 32032 then?) .

And seems to be an issue e.g. also here: https://github.com/huggingface/transformers/issues/4875

I have no idea what the right solution is or whether this is more a bug in VLLM; probably it would work to resize token embeddings after training again ( model.resize_token_embeddings(embeddings_len)) to get a match for usable vocab size and embeddings?

Feel free to close, just wanted to make you aware of this issue :).

bjoernp

LAION LeoLM org Nov 30, 2023

I think the real solution is to 1. raise an issue with vLLM and hope they fix it or 2. add dummy tokens to the tokenizer. I resized the embeddings to a multiple of 128 since this is what is apparently most efficient on h100+ GPUs. Your idea of resizing back down might also be a good and easy solution. I don't think the speed loss should be too great.

BirdThomas

Dec 7, 2023

I am trying to convert the model to gguf, llama.cpp complains about a Vocab size mismatch (model has 32128, but tokenizer.model has 32000). (I removed all from added_tokens.json. I can sure "fix" the vocab_size in the config which will eventually lead to an error loading the model: 'token_embd.weight' has wrong shape; expected 4096, 32000, got 4096, 32128,
Any ideas?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment