mradermacher/model_requests · ChaoticSoliloquy-4x8B fixed bpe

xxx777xxxASD

Apr 29

Hi, thank you for your quants!

Could you update them with fixed bpe please? Many thanks

mradermacher

Owner Apr 29

Hi!

Just to make me understand - the repo changed after I made my quants (I thought my quants were newer than your uploads, but hf is a bit stingy with exact dates)?

Did anything other than the vocabulary change? Or did I somehow not use the bpe vocab?

xxx777xxxASD

Apr 29

•

edited Apr 29

The llama3 tokenizer issue was finally solved and merged into llama.cpp

Here's the post

There is a comment that talks how to update the already existing quants, so it won't require quantizing everything agin

mradermacher

Owner Apr 29

Ah, thanks for pointing me at that. Now I know what broke command-r+ :)

Unfortunately, I am not prepared for a mass-rebuilt at the moment, or patching a lot of existing ggufs (I am still adding missing quantsa for models I made months ago), so I will add the override switch thats needed to the model card for the time being.

mradermacher

Owner Apr 30

I decided to delete the imatrix repo (llama.cpp is too broken, and I don't trust it when it crashes on some quants anymore). For various other reasons, I decided to redo the static quants from scratch.

xxx777xxxASD

Apr 30

Well, they're working. Currently im using your imat Q5_K_M quant and it showed no problems

mradermacher

Owner Apr 30

llama.cpp has no support for it atm., so it will have to wait until somebody fixes that. sigh.

NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

mradermacher

Owner Apr 30

•

edited Apr 30

The problem is that they were created using the wrong (now) converter (convert.py), which will happily convert a llama-3 model, but it will have greatly reduced quality, and convert-hf-to-gguf.py has no support for the tokenizer used.

xxx777xxxASD

Apr 30

Sad :<

mradermacher

Owner Apr 30

It's a disaster at the moment. Nobody knows how to convert what with what switches, but everything generates something as output that doesn't quite work well.

mradermacher

Owner May 1

The other question is why the mode does not match the llama-3 tokenizer - why would it be different? Your ggufs are set to gpt2.

I could offer to convert it again with convert.py. In your testing, did you set the pretokenizer to llama-3? I could force this into the generated ggufs. But if it uses the gpt3 tokenizer that would be wrong.

xxx777xxxASD

May 1

Well, if everything is bad than let's wait for imatrix fix first, thank you for trying anyway. Would you reupload static quants?

mradermacher

Owner May 1

•

edited May 1

there won't be imatrix fixes anytime soon without drastoc changes in the sttaic quants, or amybe even ever - I was told by multiple llama.cpp developers that only base models matter to them, merges are useless, and imatrix is a gimmick that they don't care about working anyway.

The issue is the both static and imatrix quants - llama 3 uses a pretokenizer requiring a regex which the c++ regex lib can't handle well, and they (apparently) do not want to link against another library, so they try to detect the pretokenizer needed. That fails with both static and imartrix quants with convert-hf-to-gguf, the supported way for llama 3 models (which can not be easily detected).

The convert.py script often silently works and either generates pure garbage, or, in the case of llama 3, something that superficially works because it tokenizes words slightly wrongly, i.e. as if you had written them wrongly, reducing quality (made-up example: instead of " open" it would see " o" and "pen" with no space between). LLMs can compensate for that which is why it works, but it increases the work for them. At least that is my understanding of the problem.

What I can do is use convert.py (as I did before when convert.py was the correct tool to convert llama models) and manually force the pretokenizer setting to llama-3. That probably works, but it is not the "correct" way to convert llama-3 models.

But for that, we would need to know what tokenizer/pretokenizer is used - the llama 3 one is supported by convert-hf-to-gguf, but yours is different. The way it works is it generates a checksum and uses that to select the internal llama.cpp pretokenizer algorithm, and this models checksum does not match anything. One can add more, but for that, we'd need to know what the model really uses.

There are a number of other llama-3 based models which generates unrecognizable checksums, so something is clearly rotten. But somebody with much more knowledge than me would have to investigate that.

So, options we have (I would really like to get this model converted, not the least because it's just the tip of the iceberg):

I can make static quants with convert.py. Reduced quality if the model needs the llama-3 pretokenizer.
like 1, but pretokenizer manually forced to llama-3, because we know (somehow), that it is correct. Requires changes ikn my pipeline, but I have a good idea how to do it.
wait for magic convert-hf-to-gguf extension/fix, or do it ourselves.
identify what causes the tokenizer mismatch and fix it in the model. or maybe brutally clone the config/tokenizer form the base model.

I can then even retry imatrix generation with a smaller training set, which reduces quality probably (usually not by much), but increases chances of llama.cpp not crashing.

The good part about the latter is that at least llama.cpp now checks for the presence of NaN values when quantizing, so crashes with a slightly clearer error message, but probably in turn also more often. A lot of models that clearly are not broken and well-used (e.g. goliath longLORA) do not work with current llama.cpp versions because the developers think they are too broken to be used.

It might even be that the imatrix generation works better because the tokenization is correct (a good chance, in fact, because it is such a big change).

It's a lot to take in, and your model is by far not the only one affected (but most llama-3 models do convert, so something must have happened to the tokenizer).

I think the method of exactly matching the tokenizer that convert-hf-to-gguf uses is a disaster, as subtle changes when tuning/merging models are all to common, but it's what we got.

xxx777xxxASD

May 1

Oh, then let's try the second option.

Btw thank you so much for your effort.

mradermacher

Owner May 2

So, the generation with forced pretokenizer worked, and the model still generates nans during imatrix generation. Since you said ti works for you, I'll keep the imatrix quants up (or rather, try to generate what I can).

Give it a try, and see if the pretokenizer makes any difference.

mradermacher changed discussion status to closed May 2

mradermacher

Owner May 2

•

edited May 2

Uh, it seems imatrix generation crashes with all quants. Thats a new thing. Sigh.

xxx777xxxASD

May 2

It's okay, im very grateful even for static quants :D

My internet speed is pretty terrible so loading all of them by myself would become my death.

By the way, what did you make them? How can i manually change pretokenizer while quantizing?

mradermacher

Owner May 2

I used the equivalent of

quantize --override-kv tokenizer.ggml.pre=str:llama3

I think llama-bpe is the preferred string (not llama3), but I use llama3 as a marker for "forced overriden". both strings are treated the same way, afaics.

The correct way (apparently) is to use convert-hf-to-gguf, but that one needs a manual entry for each type (in convert-to-hf-gguf-update.py or so, and then a patch). Right now, most llama 3 8b-based models fail because they don't have their tokenizer hash in there.

xxx777xxxASD

May 2

First

convert.py TestModelFolder --outtype f16 --vocab-type bpe

and then

quantize.exe --override-kv tokenizer.ggml.pre=str:llama3 TestModel_f16.gguf Q4_K_M ?

mradermacher

Owner May 2

•

edited May 2

Yes, although I wouldn't quantize twice (--outtype f16 is either a no-op or quantizes, I'd just leave it out - current versions of llama.cpp do the right thing). Other than that, that's almost exactly what I did.

except, I think, quantize requires the output filename:

quantize.exe --override-kv tokenizer.ggml.pre=str:llama3 TestModel_f16.gguf TestModel_Q4_K_M.gguf Q4_K_M

xxx777xxxASD

May 3

That's strange cause it never works for me when i try to quantize hf model without quantizing it to f16 first

quantize.exe --override-kv tokenizer.ggml.pre=str:llama3 TestModelFolder Q4_K_M didn't work for me

mradermacher

Owner May 3

Well, what happens when you try (convert)? Since the model is in bf16 format, quantizing to f16 is likely a quality loss.

And, as I wrote in my previous entry, your quantize syntax is wrong, try the one I gave in my last post. If that doesn't work, what happens?

mradermacher

Owner May 4

v1.5 looks much better :)

xxx777xxxASD

May 4

•

edited May 4

Thanks :>

Well, it seems that v1.5 is less stable than 1.0, so i'll try some different combination or would merge lumimaid with something else

xxx777xxxASD

May 4

wait, the tokenizer issue was fixed?

xxx777xxxASD

May 4

•

edited May 4

Well, what happens when you try (convert)? Since the model is in bf16 format, quantizing to f16 is likely a quality loss.

And, as I wrote in my previous entry, your quantize syntax is wrong, try the one I gave in my last post. If that doesn't work, what happens?

Am i still missing something?

quantize --override-kv tokenizer.ggml.pre=str:llama3 TEST-4x8B TEST-4x8B-Q5_K_M.gguf Q5_K_M

Output:

main: build = 2789 (84250014)
main: built with MSVC 19.38.33135.0 for x64
main: quantizing 'TEST-4x8B' to 'TEST-4x8B-Q5_K_M.gguf' as Q5_K_M
llama_model_quantize: failed to quantize: llama_model_loader: failed to load model from TEST-4x8B

main: failed to quantize model from 'TEST-4x8B'

mradermacher

Owner May 4

You have to specify the source gguf file. If you haven't specified an output file in the convert/convert-hf-to-gguf phase, it will be called "ggml-model-f16.gguf" in the source directory, I think. Just to be clear, you have to use convert or convert-hf-to-gguf first. If you are confused because of the f16 thing - the problem is not using convert, but specifying --outtype. If you don't specify --outtype, then the script will pick the appropriate tensor type when converting (f32 for bf16 sources, f16 otherwise).

As for the tokenizer, I have patched my copy of convert-hf-to-gguf. It should be listed int he quant kvs, but I just saw... my script must be buggy. That's why I can't even tell whether the v1.5 model needed it or not (its autodetected here).

I don't know if there is a real difference between using either convert or convert-hf-to-gguf at the moment, but you can always try convert-hf-to-gguf.py before convert.py - the former will complain if it can't handle it.

mradermacher

Owner May 5

Just a heads up, apparently a number of jobs recently were submitted with the wrong converter - long story short, the v1.5 quants were broken. I regenerate them right now.