https://huggingface.co/sirmyrrh/Kyllima-34B-v1

#329
by sirmyrrh - opened

I tried to do this myself and just couldn't figure it out.

because the tokenizer seems broken (or it hits a bug in llama.cpp). couldn't figure it out earlier today, either :)

mradermacher changed discussion status to closed

Ugh, why. I already regenerated this model once. No matter what tokenizer source I use, the result seems to be broken. Thanks for trying! :) I'll look at it again and see if I can figure it out.

Maybe @nicoboss has an idea?

I know there used to be issues with the Yi 34B tokenizer that caused problems with GGUF conversion, but I thought they were fixed in llama.cpp a while ago. Both models in this merge were GGUFed successfully on their own, and I used one of them as the tokenizer source, so it seems like it should work. There are obviously subtleties here that I don't understand. ;P

For reference, here should be the relevant output:

INFO:hf-to-gguf:Set model tokenizer
WARNING:hf-to-gguf:ignore token 64001: id is out of range, max=63999
WARNING:hf-to-gguf:ignore token 64000: id is out of range, max=63999
WARNING:hf-to-gguf:replacing token 1: '<|startoftext|>' -> '<s>'
WARNING:hf-to-gguf:replacing token 2: '<|endoftext|>' -> '</s>'
Traceback (most recent call last):
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 4359, in <module>
    main()
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 4353, in main
    model_instance.write()
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 426, in write
    self.prepare_metadata(vocab_only=False)
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 419, in prepare_metadata
    self.set_vocab()
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 1507, in set_vocab
    self._set_vocab_sentencepiece()
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 730, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/root/cvs/llama.cpp/convert_hf_to_gguf.py\", line 799, in _create_vocab_sentencepiece
    if toktypes[token_id] != SentencePieceTokenTypes.UNUSED:
       ~~~~~~~~^^^^^^^^^^
IndexError: list index out of range
job finished, status 1

I figured it out and got it to convert to GGUF finally. Thanks!

well, you didn't ask for it, but i'll make imatrix ones nevertheless :)

Thank you. It's much appreciated!

Sign up or log in to comment