I can download the files and rename them. I guess I just figured it was a mistake because most safetensors files I have seen have prefixed the filename with the part.

LoneStriker

Owner Feb 15

The issue is that the safetensors file is too big for Hugging Face with its 50 GB limit. I've updated the model card on how to join the files together. I'll update my quant scripts in the future to split the GPTQ model correctly and not require this manual step.

dannysemi

Feb 15

Thank you. Is there any way you can rename the files in the repo like part-a-gptq_model-4bit-32g.safetensors and part-b-gptq_model-4bit-32g.safetensors?

LoneStriker

Owner Feb 15

I do not believe that will work. Normally, if you properly shard the model, each file will be referenced in one of the config files to show what layers are present in which files. I've just split the model file arbitrarily. To fix, it would require a proper sharding and reupload.

dannysemi

Feb 15

Unfortunately I run out of disk space when trying that script to convert the files. I'm using a pod with limited disk space.

LoneStriker

Owner Feb 15

I'm on mobile and can't copy the exact command, but if you have 25 GB free you can do something like:
cat file-b >> file-a
mv file-a file

You are just concatenating file-b at the end of file-a and renaming it

dannysemi

Feb 15

Thanks. That worked.

Light4Bear

Feb 17

•

edited Feb 17

@LoneStriker I would like to suggest you to use the HF transformers integration of GPTQ to do the quanting instead of AutoGPTQ. The transformers does the sharding automatically.
You need latest optimum as it fixes a bug regarding passing pre-tokenized dataset.
Example:

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

def quantize(model_id, bits, group_size, dataset):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    dataset = prepare_dataset(dataset, tokenizer)  

    # you can choose to pass a pre-tokenized dataset, or pass a list of str and the tokenizer object. I personally use the exllamav2 calibration set.
    gptq_config = GPTQConfig(
        bits=bits,
        dataset=dataset,
        group_size=group_size,
        desc_act=True,
        use_cuda_fp16=True,
    )
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config)  # this is the quantization step
    model.to("cpu")
    model.config.quantization_config.dataset = None  # workaround a bug as it would try to save the dataset to config, and if the dataset is pre-tokenized it is of type torch.Tensor, which cannot be saved to json.
    model.save_pretrained(f"{model_id}_{bits}bit")
    tokenizer.save_pretrained(f"{model_id}_{bits}bit")

And I think if you exchange GPTQConfig for AWQConfig you can do AWQ, but I haven't tested that.

LoneStriker

Owner Feb 17

Thanks. I'll try this next time I run a large GPTQ quant.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment