bartowski (Bartowski)

replied to their post about 1 month ago

The test mark was after initial upload and after people pointed it out :) glad it is a good label though

posted an update about 1 month ago

Post

14881

In regards to the latest mistral model and GGUFs for it:

Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window

Yes, I got excited when a conversion worked and released them ASAP

That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model

I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !

6 replies

·

replied to their post about 2 months ago

This argument really doesn't make any sense to me.. surely if you're aiming for the most accurate overall representation anyone can see that gathering as many data points across a diverse area would yield the most useful results? Sure ideally your single light will probably get a reasonably close overall value.. but also it might not?

Additionally, I think his point was that you don't necessarily want to increase performance against a given corpus, but rather increase faithfulness to the original model against a given corpus

You may be able to keep PPL the same or better than the original while simultaneously veering far from what the original model would have generated, which while great for that corpus of text, is not what the intention of the quantization itself is (in fact many people worry about this a lot, fearing that the quantization will favour the text used as a reference, which I'm luckily seeing is not what happens at least for imatrix)

The fact that 2 models can have identical PPL scores yet generate completely different text should be proof enough that PPL only tells a tiny part of a story. Yes it's good to know the model is good, but when quantizing I don't need to know how good it is, I need to know how similar it is to the original.

replied to their post about 2 months ago

I suppose that's reasonable, I guess why I like KLD more is that I breaks it down into percentages, like mean, max, 99.99%, etc etc, where PPL is just a single all encompassing number that's more difficult to interpret

I don't know if I can put much value into IQ6 outperforming fp16 because lately we've been seeing benchmarks where Q3 beats bf16, so while useful I don't know that they can't definitively tell us quant quality, but I do think it's a good proof of competency

This is why KLD to me provides at least a slightly clearer image of how well the quantization does at recreating the original model. I see what you're saying still about PPL but (at least how llama.cpp does it) KLD gives a more thorough look. That and TOP p is nice to see how often the models agree on the token

replied to their post about 2 months ago

That's not an invalid point, but also when the final goal is quantization that 0.03% is negligible compared to the rest of the losses.

If you're talking about running at full precision, yeah, bf16 > fp16 by all means

I'd also prefer to see KLD of fp16 vs bf16 since PPL is, to me, pretty meaningless. I'm sure it has value and probably more than I give it, but unless it's PPL against the dataset it was trained on I don't really find much merit to it.

I appreciate the breakdown though, and even 0.4% is not enough to worry me when again the final goal is quantization, not to run it at that DTYPE.

To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.

replied to their post 2 months ago

Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')

posted an update 2 months ago

Post

29589

Reposting from twitter:

Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!

In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!

4 replies

·

replied to their post 2 months ago

I suppose I should add, that this is more valuable as a pseudo comparison to bf16

Since bf16 can represent the range (1, -1) with more precision than fp16, there is much debate as to whether it's safe to convert from bf16 to fp16, or if you should keep bf16, or even upcast to fp32, in order to preserve the original quality of the model for as long as possible before quantizing to 8 bits

This test shows that fp16 is capable of represent 99.97% of the weights in an FP32 model precisely, and therefore represents a negligible at best difference

Additionally, since the weights it can't represent are between 6e-5 and -6e-5, the weights it can't represent are so small that they most likely do not contribute to the finally output of the model and are relatively safe to prune

posted an update 2 months ago

Post

16100

Decided to try to check how many weights in a 70b F32 model would be squashed when converted to F16 (spoiler, it's shockingly few)

The reason for this comparison is that it should represent the same percentage of squishing as bf16 to fp16

Had claude make me a script, using the new Reflection-70B, and these are the results:

Total weights: 70553706496
Fully representable: 70530215524
Squashed: 23490972
Percentage squashed: 0.03%

0.03%!!!!

A couple things to note, this uses a roundtrip of F32 -> F16 -> F32 and then torch.isclose to account for rounding errors that come up by the very nature of extremely accurate numbers, but it uses VERY small tolerances (rtol=1e-5, atol=1e-8)

This is also examining EVERY weight that was stored at F32, and for most layers I was somewhere between 0% and 0.03% of weights being squashed, no major outliers.

Overall, I feel even safer converting to F16 for llama.cpp, the extremely small number of weights that fall outside the range are likely so small that they don't actually play a role in the final output of the model at inference anyways.

20 replies

·

replied to their post 3 months ago

also maybe there should be a new feature to be explicitly notified about new repositories

That would be amazing, probably for average users but especially for me, where I sometimes stumble upon a model uploaded days ago that I somehow didn't notice from a creator I enjoy

We will have to see if something like that is possible without cluttering up the profile pages too much. But we'll try.

That sounds awesome, could even consider something like a toggle in the settings for "show this model on my page" or something, and possibly as a variable when using huggingface-cli or the HF python API

I think we'll be doing a social features sprint soon and this is exactly the kind of feedback we need! Thank you so much!

Beautiful, I love this :D If you need feedback on anything specific feel free to reach out, would love to be a guinea pig or just early eyes !

posted an update 3 months ago

Post

4668

@victor (is this the only way to "DM" on HF?)

Had a funny thought, would it be at all possible to rework what shows up on our personal HF page?

Picture this: I upload a model to an organization, someone who follows me now has no idea that I've uploaded a model or to where, unless they also watch those repos (which also floods them with other notifications)

What if our main Huggingface page was a collection of both models that we've uploaded specifically to our profile, as well as models we've uploaded to organizations? That way it would all be contained in one central followable location, and I wouldn't have concerns about losing followership if I wanted to upload to an organization all of a sudden.

3 replies

·

replied to victor's post 3 months ago

Oh another big pain point: notifications

I would love to be able to subscribe to be notified of new models posted by people or organizations, but it's near impossible as is

replied to victor's post 3 months ago

I would love better filtering

First I think sort by created is broken, but haven't checked on desktop recently

Second, I would love date filtering, like show me trending models that were only posted or updated in the past 7 days and such

reacted to clem's post with 🤗 3 months ago

Post

3626

This isn’t a goal of ours because we have plenty of money in the bank but quite excited to see that @huggingfaceis profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!

Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!

4 replies

·

replied to clem's post 3 months ago

I'm happy to hear this too, money in the bank is good, but upwards momentum makes it so much easier to justify investing in new technology and improving things!

reacted to clem's post with ❤️ 3 months ago

Post

3626

This isn’t a goal of ours because we have plenty of money in the bank but quite excited to see that @huggingfaceis profitable these days, with 220 team members and most of our platform being free (like model hosting) and open-source for the community!

Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!

4 replies

·

posted an update 3 months ago

Post

10009

So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp

It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important

However what the quantization then does with that information is where I was wrong.

I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW

Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations

The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group

But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix

Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up

5 replies

·

replied to their post 3 months ago

much more difficult though if you're trying to iterate, definitely an interesting final validation

replied to their post 3 months ago

oh god dammit haha, i did not think of that possibility AT ALL 🤦

KL Divergence is almost identical - though even then upsetting that it's "almost" - but yup there's huge differences in the top p...

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.339378 ±   0.038949
Mean PPL(base)                :   6.337070 ±   0.038896
Cor(ln(PPL(Q)), ln(PPL(base))):  99.99%
Mean ln(PPL(Q)/PPL(base))     :   0.000364 ±   0.000067
Mean PPL(Q)/PPL(base)         :   1.000364 ±   0.000067
Mean PPL(Q)-PPL(base)         :   0.002308 ±   0.000427

====== KL divergence statistics ======
Mean    KLD:   0.000005 ±   0.000001
Maximum KLD:   0.113848
99.9%   KLD:   0.000346
99.0%   KLD:   0.000055
99.0%   KLD:   0.000055
Median  KLD:   0.000001
10.0%   KLD:  -0.000014
 5.0%   KLD:  -0.000021
 1.0%   KLD:  -0.000035
Minimum KLD:  -0.000120

====== Token probability statistics ======
Mean    Δp:  0.002 ± 0.000 %
Maximum Δp: 19.102%
99.9%   Δp:  0.417%
99.0%   Δp:  0.155%
95.0%   Δp:  0.067%
90.0%   Δp:  0.040%
75.0%   Δp:  0.010%
Median  Δp:  0.000%
25.0%   Δp: -0.007%
10.0%   Δp: -0.034%
 5.0%   Δp: -0.062%
 1.0%   Δp: -0.154%
 0.1%   Δp: -0.439%
Minimum Δp: -5.820%
RMS Δp    :  0.078 ± 0.016 %
Same top p: 99.927 ± 0.007 %

replied to their post 3 months ago

Either way I appreciate the insight and now question all my life decisions, especially the ones that involved me uploading fp32 files and spending 3x the time calculating imatrix on bf16 instead of fp16

Bartowski PRO

AI & ML interests

Organizations

bartowski's activity