Testing experimental quants
I'm going to be testing Meta-Llama-3-8B-Instruct-f16-q4_K_S.gguf
against Meta-Llama-3-8B-Instruct-q4_K_S.gguf
, I'll share any findings in this thread
excellent, I appreciate it!!
Here is a repo with some results: ddh0/UnquantizedEmbeddingTesting
There are a couple files in the repo that are not detailed in the README, but there is some information there that may be interesting. Let me know if there are any specific models or tests that you'd like done.
TLDR: there is a measurable difference between models with unquantized vs quantized embedding/output tensors, but exactly how important the difference is should be investigated more
cc @ZeroWw
Explaining Newton's laws of motion using examples and analogies
Q8_0: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.
f16.Q2_K: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.
Q4_K_S: it does not have enough contextual understanding of the prompt in order to properly adhere to the instructions; it only gives an example and analogy.
f16.Q4_K_S: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.
Even something as basic as this, where giving the definition is heavily implied, Q4_K_S fails to understand this yet f16.Q2_K does so while being slightly smaller.
Create an algorithm in Python code to generate a random password between 8 and 15 characters containing lowercase letters, uppercase letters, and numbers
Q8_0: in-depth code explanation; gave a step-by-step explanation of what the code does, identified a potential shortcoming, and offered a suggestion for modifying the code.
f16.Q2_K: basic code explanation; all it did was state that the code fit the criteria.
Q4_K_S: surface level code explanation; made very obvious observations of the code such as, "random
generates random characters" and "generate_password
generates the password."
f16.Q4_K_S: in-depth code explanation; gave a step-by-step explanation of what the code does, identified a potential shortcoming, and offered a suggestion for modifying the code.
I ran the code for all 4 of them and they all did what was asked, and the code for all of them was nearly identical, expect for Q4_K_S, which took a very different approach from the rest. The difference between f16 and non-f16 embeddings and output tokens is very clear with the Q4_K_S and f16.Q4_K_S comparison: Q4_K_S gave an extremely obvious code explanation that had no depth, while f16.Q4_K_S understood step-by-step what the code was doing.
Conclusion
F16.Q2_K has just as much contextual understanding as Q8_0; Q4_K_S, on the other hand, had significantly worse contextual understanding than f16.Q2_K despite being slightly larger than it. I only highlighted two, but most of the other side-by-side comparisons had the same conclusion, just to varying degrees. In my own personal tests, I have seen a difference in contextual understanding between Q8_0 and f16.Q8_0, but it was nothing as comprehensive as ddh0's.
At a bare minimum, Q4 and below should be using f16 embeddings and output tensors, or at the very least be given as an additional option to choose from since it increases the file size. Some 1:1 comparisons between Q8_0 and f16.Q8_0 and Q6_K and f16.Q6_K would be good to see if this should be implemented across the board. I would be particularly interested in a comparison between Q6_K and f16.Q4_K_M since they're nearly identical in file size.
In my own tests with Mistral v03 and Wizard LM2 f16.q5 and f16.q6 gave the best results.
You can find the quantizations in my profile.
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF <<<<<<<<<<<<<
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
Test results for f16-q6_K vs q6_K and f16-q8_0 vs q8_0 are available in the repo (still need to update the README)
My feedback for q8_0 VS q8_1 based on a 4200-token 21 questions survey, Client= LM Studio, temp=0, topP=0.95, system prompt: Perform the task to the best of your ability.
First shot for each were basically the same, after regenerated more than 3 times, there was some differences: 1. q8_1 followed the instructions better, q8_0 stopped responding after a summarization task in the middle. 2. Quality of answered tasks was similar.
I suspect q8_0 file is broken, I also downloaded and tried bartowski/tabula-8b-GGUF q8_0 and q8_0_L. I don't know what's wrong with this, both doesn't work with LM Studio v.0.2.25, with presets Llama3 or ChatML.
I completely spaced out the fact that what I looked at was a comparison of f16 vs. f16.Q4_K_S and not f16.Q4_K_S vs. Q4_K_S. Most of my previous conclusion should still be correct; however, I have to go back and redo some things. I'll update when I finish comparing all of the variants with each other, but at a quick glance, f16, f16.Q8_0, and f16.Q6_K all seem to be nearly identical and preferable over Q8_0.
EDIT: Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was? Also, I hate to ask this since the comparisons were already run, but would you be able do another run where each model has its own separate file for the responses? It would make it much easier to do the comparisons since I can just highlight the differences in a text editor.
Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was?
@HiroseKoichi It's f16-q4_K_S vs. regular q4_K_S
Also, which models would you like me to compare?
Not comparisons this time; I want each model individually run on the 40 prompts so that they each have their own text file. The current output is good for automatic evaluation but very hard for manual evaluation. If I want to compare the output of one model against another, I have to copy the first half back into both responses and then into their own separate text files if I want to see them visually side-by-side. If each model's responses are in its own text file, then I can just select two files and run a diff check in a text editor to highlight all the differences.
Ah okay. I'll set that up
I made some more quantizations: (the q4, q5 and q8 are f16/q4 f16/q5 and f16/q8)
You find them all in the models section at https://huggingface.co/ZeroWw
P.S. I didn't do q4 because q4_k quantization imho are bad in most cases, but you are free to try f16/q4... but the f16/q5 is probably better.
https://huggingface.co/ZeroWw/Samantha-Qwen-2-7B-GGUF <<<<<<<<<<<<<
https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-GGUF
https://huggingface.co/ZeroWw/microsoft_WizardLM-2-7B-GGUF
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
https://huggingface.co/ZeroWw/Phi-3-mini-128k-instruct-GGUF
https://huggingface.co/ZeroWw/Phi-3-medium-128k-instruct-GGUF
https://huggingface.co/ZeroWw/Qwen1.5-7B-Chat-GGUF
https://huggingface.co/ZeroWw/Mistroll-7B-v2.2-GGUF
https://huggingface.co/ZeroWw/NeuralDaredevil-8B-abliterated-GGUF
https://huggingface.co/ZeroWw/MixTAO-7Bx2-MoE-v8.1-GGUF
https://huggingface.co/ZeroWw/aya-23-8B-GGUF
I want each model individually run on the 40 prompts so that they each have their own text file
@HiroseKoichi sorry for the delay, this is done now. Each model has its results in a separate file in the repo: ddh0/UnquantizedEmbeddingTesting
All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find
All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find
too many because you used random seeds.
in a comparison like this the seed should be fixed and you should include also some questions that include reasoning and some that include creative writing.
That's because the output tensor affects the "way" it express itself, while the embed tensor affects more it's understanding.
Also, add one test of the pure f16 (convert the hf model to f16) like:
python llama.cpp/convert-hf-to-gguf.py --outtype f16 model_name --outfile ${model_name}.f16.gguf
that's because f16 above will be the "baseline".
here you can find a bunch of models with the f16 and f16.q5, f16.q6 and f16.q8: https://huggingface.co/RobertSinclair
CC @ddh0 , @bartowski @helloAI333
too many because you used random seeds.
Don't think seeds are relevant in this case as I'm not doing any sampling
too many because you used random seeds.
Don't think seeds are relevant in this case as I'm not doing any sampling
@ddho
in general, no... but making the same questions achieves different results according to seeds.. and it's more difficult to determine how a model is degraded if the seeds are random.
he's got temperature = 0.0 which means that seed doesn't play a role
@ddh0 I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.
Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.
Results for pure bf16 test are up: Results_Meta-Llama-3-8B-Instruct-bf16.gguf.txt
I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.
Thank you, but this is intentional and I don't think it's a problem
Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.
Will do now
Here is a text file with the sizes of each model in bytes (as outputted from
ls -al
on my machine): sizes.txt
weird.. in your "sizes" I read:
7835472160 Jun 16 18:30 Meta-Llama-3-8B-Instruct-f16-q6_K.gguf
while in my quantization is:
7.84 GB
can you check if the file is the same?
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.q6_k.gguf
I ask because I am not sure what makes my quantization better.. it might be anything.
I would suggest you to do tests comparing the F16 in my repository to the q5 q6 and q8 in the same directory.
Those are sure the right files.
to obtain them I run a colab notebook which main part is this:
import os
import subprocess
repo_model_name = 'gradientai/Llama-3-8B-Instruct-Gradient-1048k' #@param ["mistralai/Mistral-7B-Instruct-v0.3", "lucyknada/microsoft_WizardLM-2-7B", "meta-llama/Meta-Llama-3-8B-Instruct", "BarraHome/Mistroll-7B-v2.2","Qwen/Qwen1.5-7B-Chat","microsoft/Phi-3-mini-128k-instruct","microsoft/Phi-3-medium-128k-instruct","google/gemma-7b",'zhengr/MixTAO-7Bx2-MoE-v8.1','CohereForAI/aya-23-8B','01-ai/Yi-1.5-9B-32K','deepseek-ai/DeepSeek-Coder-V2-Lite-Base','01-ai/Yi-1.5-6B-Chat','ZeusLabs/L3-Aethora-15B-V2','Nitral-AI/Hathor_Stable-v0.2-L3-8B'] {allow-input: true}
model_name = os.path.basename(repo_model_name)
# Download Model
print(f'Downloading {repo_model_name}')
subprocess.run(['huggingface-cli', 'download', repo_model_name, '--local-dir', model_name], stdout=subprocess.DEVNULL)
# Convert Model
print('Converting model to f16.')
subprocess.run(['python', 'llama.cpp/convert-hf-to-gguf.py', '--outtype', 'f16', model_name, '--outfile', f'{model_name}.f16.gguf'], stdout=subprocess.DEVNULL)
# Remove the original model directory
os.system(f'rm -rf {model_name}')
# Quantize Model
quantization_types = ['q5_k', 'q6_k', 'q8_0']
for q_type in quantization_types:
print(f'Quantizing {q_type}')
subprocess.run(['./build/bin/llama-quantize', '--allow-requantize', '--output-tensor-type', 'f16', '--token-embedding-type', 'f16', f'{model_name}.f16.gguf', f'{model_name}.{q_type}.gguf', q_type, str(os.cpu_count())], stdout=subprocess.DEVNULL)
7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB
7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB
7835472160/1024/1024/1024 = 7.29 GB
No, I do not confirm that. If you want to confirm that on your own, go ahead
Edit: I don't think that the exact file size in bytes is going to help you figure anything out, for what it's worth
This is how the sizes should be:
-rw-r--r-- 1 root root 16068890912 Jun 28 05:55 Meta-Llama-3-8B-Instruct.f16.gguf
-rw-r--r-- 1 root root 7042224416 Jun 28 06:07 Meta-Llama-3-8B-Instruct.q5_k.gguf
-rw-r--r-- 1 root root 7835472160 Jun 28 06:15 Meta-Llama-3-8B-Instruct.q6_k.gguf
-rw-r--r-- 1 root root 9525776672 Jun 28 06:17 Meta-Llama-3-8B-Instruct.q8_0.gguf
What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?
What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?
No need to be snippy, but if the size is not the same it means the quantization process was different than the one I proposed. That's all.