BSC-LT/salamandra-7b · possible issue with tokenizer

14 days ago

•

I've been using llama.cpp to quantize these models (the 2b variants so far) with a robust dataset (the Oscar one) for the imatrix. And now I am on the 7b. For the first time, I am getting these errors (explicitly, I searched the last one):

llm_load_vocab: control token:     99 '<|reserved_token_94|>' is not marked as EOG
...
llm_load_vocab: control token:     34 '<|reserved_token_29|>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect

These are essentially warnings, and looking on the site I see sometimes this last error exists for public models, so maybe this is all okay. But I wanted to raise th issue in case it matters.

robbiemu

13 days ago

•

edited 13 days ago

I am having issues generating the imatrix. I am pretty well convinced at this point that these are warnings introduced with a recent change to llama.cpp: https://github.com/ggerganov/llama.cpp/issues/9899

robbiemu

13 days ago

•

edited 13 days ago

I am still having issues, it keeps crashing while generating the importance matrix.

I noticed that the default handling for SentencePiece-BPE type tokenizers in llama.cpp was not using the tokenizer.json, just the tokenizer.model. I modified the convert_hf_to_gguf.py script to allow it to extend the vocab from the tokenizer.json, but I still get the same "nan detected inblk.21.attn_output.weight" when generating the imatrix.

I see that two weeks ago in the 7b-instruct discussion someone had issues (presumably without generating an imatrix) and it was mentioned that GGUF models were incoming. Are you guys working with llama.cpp to do the conversion? Can you shed any light on my issue?