@bartowski on Hugging Face: "Decided to try to check how many weights in a 70b F32 model would be squashed…"

bartowski

posted an update Sep 5

Post

16100

Decided to try to check how many weights in a 70b F32 model would be squashed when converted to F16 (spoiler, it's shockingly few)

The reason for this comparison is that it should represent the same percentage of squishing as bf16 to fp16

Had claude make me a script, using the new Reflection-70B, and these are the results:

Total weights: 70553706496
Fully representable: 70530215524
Squashed: 23490972
Percentage squashed: 0.03%

0.03%!!!!

A couple things to note, this uses a roundtrip of F32 -> F16 -> F32 and then torch.isclose to account for rounding errors that come up by the very nature of extremely accurate numbers, but it uses VERY small tolerances (rtol=1e-5, atol=1e-8)

This is also examining EVERY weight that was stored at F32, and for most layers I was somewhere between 0% and 0.03% of weights being squashed, no major outliers.

Overall, I feel even safer converting to F16 for llama.cpp, the extremely small number of weights that fall outside the range are likely so small that they don't actually play a role in the final output of the model at inference anyways.

vertis

Sep 5

Forgive my ignorance, but isn't this just an extremely simple quantisation (or at least one step of it)?

bartowski

Sep 5

I suppose I should add, that this is more valuable as a pseudo comparison to bf16

Since bf16 can represent the range (1, -1) with more precision than fp16, there is much debate as to whether it's safe to convert from bf16 to fp16, or if you should keep bf16, or even upcast to fp32, in order to preserve the original quality of the model for as long as possible before quantizing to 8 bits

This test shows that fp16 is capable of represent 99.97% of the weights in an FP32 model precisely, and therefore represents a negligible at best difference

Additionally, since the weights it can't represent are between 6e-5 and -6e-5, the weights it can't represent are so small that they most likely do not contribute to the finally output of the model and are relatively safe to prune

Hampetiudo

Sep 11

Why not convert to a bf16 gguf? You can have your cake and eat it too.

bartowski

Sep 11

Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')

Join the conversation