solidrust/Meta-Llama-3.1-8B-Instruct-abliterated-AWQ

Aug 14

Would it be possible to quantize cognitivecomputations/dolphin-2.9.4-llama3.1-8b? I wanted to do that, but I am getting an error perhaps due to setup. I created an issue for that.

Suparious

SolidRusT Networks org Aug 14

sure. i will do it now.
thank-you.

Suparious

SolidRusT Networks org Aug 15

I cant seem to do any AWQ quants anymore.
I also get the same error with the llm-quantkit python package.

Suparious

SolidRusT Networks org Aug 15

•

edited Aug 15

pip install --upgrade llm-quantkit[cuda]
quantkit awq cognitivecomputations/dolphin-2.9.4-llama3.1-8b -out dolphin-2.9.4-llama3.1-8b-AWQ

Even the simplest example is not working.

Suparious

SolidRusT Networks org Aug 15

https://github.com/casper-hansen/AutoAWQ/issues/558

Suparious

SolidRusT Networks org Aug 15

•

edited Aug 15

seems this model needs transformers>=4.44.0.dev0 and AutoAWQ library wants 4.35 or something like that.
I will try downgrading the transformers version to see if it can work

Suparious

SolidRusT Networks org Aug 15

•

edited Aug 15

OK, i've changed the autoawq-kernel, and rebuilding the wheels for it, maybe I can get this working afterall..
basically, both of them(awq and kernels repo) lock the pytorch version to specifically to 2.3.1 and we need 2.4.0 to work with the new transformers.

Suparious

SolidRusT Networks org Aug 15

Let's go!

Suparious

SolidRusT Networks org Aug 15

Completed: https://huggingface.co/solidrust/dolphin-2.9.4-llama3.1-8b-AWQ

thank-you for you encouragement, this problem pissed me off and I had given up on it

Suparious changed discussion status to closed Aug 15

vaclavkosar

Aug 15

You are the best! Thank you.

vaclavkosar

Aug 24

These Mintron models that distilled Llama 3.1 8B into 4B look to be working. One fine-tune is: Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05

Suparious

SolidRusT Networks org Aug 24

•

edited Aug 24

Getting a weird error with that one:

in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.

I will need to debug it. Here is an example of how to reproduce the error (feeling too lazy to fix my repo, so using quantkit today):

transformers==4.42.3
llm-quantkit[cuda]

import json
import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder

model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
quanter = 'suparious'
quant_org = 'solidrust'

# Download the model to a local directory
local_model_path = snapshot_download(model_path)

# Load the model configuration file from the local directory
config_file = os.path.join(local_model_path, "config.json")

with open(config_file, "r") as f:
    config_dict = json.load(f)

# Modify the rope_scaling dictionary to include only the required fields
if 'rope_scaling' in config_dict:
    rope_scaling = config_dict['rope_scaling']
    if 'type' in rope_scaling and 'factor' in rope_scaling:
        # Ensure the type is one of the valid values
        if rope_scaling['type'] not in ['linear', 'dynamic']:
            rope_scaling['type'] = 'linear'  # Set to a default valid value
        # Ensure the factor is a float greater than 1
        if not isinstance(rope_scaling['factor'], float) or rope_scaling['factor'] <= 1.0:
            rope_scaling['factor'] = 2.0  # Set to a default valid value
        config_dict['rope_scaling'] = {'type': rope_scaling['type'], 'factor': rope_scaling['factor']}
    else:
        # If 'type' or 'factor' is missing, set default values
        config_dict['rope_scaling'] = {'type': 'linear', 'factor': 2.0}

# Save the modified configuration file
with open(config_file, "w") as f:
    json.dump(config_dict, f, indent=2)

# Load the model with the modified configuration
model = AutoAWQForCausalLM.from_pretrained(
    local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

# Upload the quantized model to the Hugging Face Hub
create_repo(repo_id=f"{quant_org}/{quant_path}")
upload_folder(
    folder_path=quant_path,
    repo_id=f"{quant_org}/{quant_path}",
)

print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')

Suparious

SolidRusT Networks org Aug 24

I had locked transformers version to workaround a llama 3.1 bug, but maybe the latest library is working / required for this model?

Nope, latest transformers doesn't stop AWQ from choking on these rope scaling methods that people are using to extend LLM context windows.

python quantize.py
Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 156503.88it/s]
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards:   0%|                                                                                      | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/smd/quantize.py", line 42, in <module>
    model = AutoAWQForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/auto.py", line 71, in from_pretrained
    return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/awq/models/base.py", line 380, in from_pretrained
    model = target_cls.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3960, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4434, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/ubuntu/venv-quantkit/lib/python3.11/site-packages/accelerate/utils/modeling.py", line 373, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 3072]) in "weight" (which has shape torch.Size([768, 3072])), this looks incorrect.

Suparious

SolidRusT Networks org Aug 24

I stopped doing AWQ quants as I would need to really learn AutoAWQ library to keep up with all the weird stuff that the community does to the foundational models.
I'll try my best to debug this one.

Suparious

SolidRusT Networks org Aug 24

•

edited Aug 24

OK, I managed to handle this is such a shitty and miserable way....

def override_rope_embeddings():
    from transformers.models.llama.modeling_llama import apply_rotary_pos_emb

    def custom_apply_rotary_pos_emb(q, k, cos, sin):
        min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
        q = q[..., :min_dim]
        k = k[..., :min_dim]
        cos = cos[..., :min_dim]
        sin = sin[..., :min_dim]
        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)

    # Override the existing function
    transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb

This is so stupid....

The attention layers in this model are transitioning from computing the RoPE embeddings internally through position_ids (2D tensor with the indexes of the tokens), to using externally computed position_embeddings (Tuple of tensors, containing cos and sin). In transformers v4.45 position_ids will be removed and position_embeddings will be mandatory.

It is quantizing now...

Suparious

SolidRusT Networks org Aug 24

•

edited Aug 24

OK, @vaclavkosar - Thank-you for the Saturday morning algebra challenge.

solidrust/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ is ready, not sure it it will work properly or not, I did alot of messing around this time.

vaclavkosar

Aug 24

Trying now!

I have to say that solidrust/Meta-Llama-3.1-8B-Instruct-abliterated-AWQ was the best model so far. Maybe because the Llama fine-tuning is exceptional and abliteration just add the free-range talk back in.

vaclavkosar

Aug 24

It failed with:

/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    360     if value is not None:
    361         if old_value.shape != value.shape:
--> 362             raise ValueError(
    363                 f'Trying to set a tensor of shape {value.shape} in "{tensor_name}" (which has shape {old_value.shape}), this look incorrect.'
    364             )

ValueError: Trying to set a tensor of shape torch.Size([3072, 96]) in "qweight" (which has shape torch.Size([3072, 128])), this look incorrect.

I think quant config needs to be added like: https://huggingface.co/solidrust/Starling-LM-7B-beta-AWQ/blob/main/quant_config.json

Suparious

SolidRusT Networks org Aug 24

Yeah, these models have this issue.

Here is the script that I used, but it is not great as the quantized model seems shit after...

import json
import os
import torch
import transformers
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, AutoConfig
from huggingface_hub import snapshot_download, create_repo, upload_folder

model_path = 'Magpie-Align/Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05'
quant_path = 'Llama-3.1-Minitron-4B-Magpie-SFT-800K-MT-Magpo-3.1-Pro-05-AWQ'
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
quant_org = 'solidrust'

def download_model(model_path):
    try:
        return snapshot_download(model_path)
    except Exception as e:
        print(f"Error downloading model: {e}")
        raise

def sanitize_rope_scaling(config_dict):
    if 'rope_scaling' in config_dict:
        rope_scaling = config_dict['rope_scaling']
        if isinstance(rope_scaling, dict):
            valid_keys = ['type', 'factor']
            rope_scaling = {k: v for k, v in rope_scaling.items() if k in valid_keys}
            if rope_scaling.get('type') not in ['linear', 'dynamic']:
                print(f"Invalid 'type' in rope_scaling. Setting to 'linear'.")
                rope_scaling['type'] = 'linear'
            if not isinstance(rope_scaling.get('factor'), float) or rope_scaling['factor'] <= 1.0:
                print(f"Invalid 'factor' in rope_scaling. Setting to 2.0.")
                rope_scaling['factor'] = 2.0
        else:
            print("Unexpected format for 'rope_scaling'. Removing it.")
            del config_dict['rope_scaling']
    return config_dict

def override_rope_embeddings():
    from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, rotate_half

    def custom_apply_rotary_pos_emb(q, k, cos, sin):
        # Truncate or pad the tensors to match dimensions
        min_dim = min(q.shape[-1], cos.shape[-1], sin.shape[-1])
        q = q[..., :min_dim]
        k = k[..., :min_dim]
        cos = cos[..., :min_dim]
        sin = sin[..., :min_dim]
        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)

    # Override the function within transformers
    transformers.models.llama.modeling_llama.apply_rotary_pos_emb = custom_apply_rotary_pos_emb

def load_model(local_model_path):
    try:
        config_file = os.path.join(local_model_path, "config.json")
        with open(config_file, "r") as f:
            config_dict = json.load(f)
        
        config_dict = sanitize_rope_scaling(config_dict)
        
        with open(config_file, "w") as f:
            json.dump(config_dict, f, indent=2)
        
        # Load model with to_empty to avoid copying from meta tensors
        model = AutoAWQForCausalLM.from_pretrained(
            local_model_path, **{"low_cpu_mem_usage": True, "use_cache": False}, ignore_mismatched_sizes=True
        )
        model.to_empty(device=torch.device("cuda"))
        return model
    except Exception as e:
        print(f"Error loading model: {e}")
        raise

def quantize_model(model, tokenizer):
    try:
        model.quantize(tokenizer, quant_config=quant_config)
    except ValueError as ve:
        print(f"Quantization Error: {ve}")
        raise

def upload_to_hf(quant_path, quant_org):
    try:
        create_repo(repo_id=f"{quant_org}/{quant_path}", exist_ok=True)
        upload_folder(
            folder_path=quant_path,
            repo_id=f"{quant_org}/{quant_path}",
        )
        print(f'Quantized model uploaded to "{quant_org}/{quant_path}"')
    except Exception as e:
        print(f"Error uploading to Hugging Face: {e}")
        raise

def main():
    local_model_path = download_model(model_path)
    
    # Override RoPE embedding calculations
    override_rope_embeddings()

    model = load_model(local_model_path)
    tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

    quantize_model(model, tokenizer)
    
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)

    print(f'Model is quantized and saved at "{quant_path}"')
    upload_to_hf(quant_path, quant_org)

if __name__ == "__main__":
    main()

Suparious

SolidRusT Networks org Aug 24

•

edited Aug 24

I was told by Casper Hansen that the quant_config.json is no longer supported as he is adding this JSON block into the model's native config.json. I don't like this approach and prefer to use the quant_config.json in order to prevent molesting the native models config JSON. Also, some tools and apps still look for this file. So my quant process usually adds in this file, but today we used the quantkit project, which seems to have been abandoned, but simplifies what I wanted to do with my srt-model-quantizing repo. I might pick up that project and deprecate my version.

So that is the reason why my AWQ quants typically always have the quant_config.json, despite @casperhansen advice.

Suparious

SolidRusT Networks org Aug 24

•

edited Aug 24

aslo super pissed off about AutoAWQ requiring torch==2.1.3 which is such a shitty torch version, with most of the issues fixed in 2.4.
I always have to build my own AutoAWQ, and the kernels, to support the latest transformers and torch. Such a useless waste of my time.
I made a fork of AutoAWQ that just always enjoys the latest versions of everything, but sometimes it is unstable, so that is the current state of this issue.

vaclavkosar

Aug 31

I see. Well, these things probably will get resolved later by the package authors.

This particular fine-tune is probably not that good. But in general, these Llama Minitron model fine-tunes will probably be used, since they are efficient distillations.

djuna

Aug 31

Hey, can you share your autoawq repository?

Suparious

SolidRusT Networks org Aug 31

It is currently https://github.com/SolidRusT/srt-model-quantizing but I haven't worked on it in awhile.
this Pypi package may be easier to use: https://pypi.org/project/llm-quantkit/

djuna

Sep 1

Thanks.

vaclavkosar

Sep 1

In general, I must tell you, pip dependencies are horrible. I just wanted to run an old, unrelated notebook. And it wouldn't start even when I pinned the original versions of the main packages. I would have to pin down every single one. For example with poetry lock file.

Suparious

SolidRusT Networks org Sep 2

I will setup a Docker image will all the python stuff sorted out, to help users with this exact issue, however, I was able to get the https://github.com/SolidRusT/srt-model-quantizing/awq repo to a stable state. I literally worked on this all day today.
I still cant quant on my 12GB, just like I had done, over 500 AWQ quants. The reason I stopped doing them, is that the memory management in AutoAWQ is not existant / incomplete. and I havent figured out how to solve it. I even connected with Casper Hansen on it.

but I can rent a NVIDIA A10g machine from Amazon, and the quant works fine on that 24GB GPU.
I also got the new llama3.1 models to quant there, using my awq repo.

djuna

Sep 2

•

edited Sep 2

That sounds painful... But, is it like, solvable in the python side? With new 3.12? That memory thing

Suparious

SolidRusT Networks org Sep 2

unfortunately, this seems to be a compounded issue issue with AutoAWQ, Llama 3.1. rope_scaling hackery and then GPU VRAM.
There is no way to use python 3.12 here. I tried for weeks, and there is just too much work to figure out by myself, and I had to take a break.

the problem with people using the rope scaling methods to increase the shitty context limitation of llama 3.1 (8192 tokens), is that to quantize for AWQ, you need to ensure your tensors are only on a single device (single CPU or single GPU), and there is no conceivable way to distribute them. I really detest this methodology of increasing context windows of models, for this reason.

so now I have my repo exclusively using a single device for tensors, which solves for shitty llama 3.1 rope scaling, but this now disables me from using multi-GPU, partial CPU offload and other memory management techniques. and I cant even quant a 8B model on my 12GB GPU, of which I have over 12 of them, and I intended to automate AWQ quants automatically with this hardware investment.

But now that is is all memory fucked, I have to just release my pipeline and help people make their own AWQ.

I cant do more than 24GB in AWS. I can do multi-GPU, but this is now disabled in AutoAWQ.

I am sincerely considering making a fork of it, and using Claude Sonnet 3.5 to fix AutoAWQ automagically for us.
Maybe tomorrow, I am exhausted from todays refactoring.

I got 98% code coverage and all unit tests passing now in my repo.

djuna

Sep 2

Great work!
But.. let say we delete rope scaling from config.json quantize it and put it back after quantization, will that be a problem? But idk, their rope scaling type is different that usual isn't it?

Suparious

SolidRusT Networks org Sep 2

Trying this now, seeing this message, but it might still work... we'll see:

The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be removed and `position_embeddings` will be mandatory.

Suparious

SolidRusT Networks org Sep 2

Genius idea, seems to work.
solidrust/Hermes-3-Llama-3.1-8B-lorablated-AWQ
let me play more with my script and get it stable.

djuna

Sep 2

That's unexpected

solidrust
/

Meta-Llama-3.1-8B-Instruct-abliterated-AWQ

dolphin-2.9.4-llama3.1-8b ?