Safetensor naming convention

#1
by dannysemi - opened

The names of the safetensors in this repo do not match the naming conventions that vLLM expects when loading a model from HF.

You can just rename the file to whatever vLLM expects to get it working. There are no config file references to the filename from what I can find.

I'll see if there's a better way to indicate the GPTQ quant settings outside of the filename.

It looks like vLLM specifically tries to match the pattern *.safetensors.

https://github.com/vllm-project/vllm/blob/31348dff03d638eb66abda9bec94b8992de9c7a1/vllm/model_executor/weight_utils.py#L137

I can download the files and rename them. I guess I just figured it was a mistake because most safetensors files I have seen have prefixed the filename with the part.

The issue is that the safetensors file is too big for Hugging Face with its 50 GB limit. I've updated the model card on how to join the files together. I'll update my quant scripts in the future to split the GPTQ model correctly and not require this manual step.

Thank you. Is there any way you can rename the files in the repo like part-a-gptq_model-4bit-32g.safetensors and part-b-gptq_model-4bit-32g.safetensors?

I do not believe that will work. Normally, if you properly shard the model, each file will be referenced in one of the config files to show what layers are present in which files. I've just split the model file arbitrarily. To fix, it would require a proper sharding and reupload.

Unfortunately I run out of disk space when trying that script to convert the files. I'm using a pod with limited disk space.

I'm on mobile and can't copy the exact command, but if you have 25 GB free you can do something like:
cat file-b >> file-a
mv file-a file

You are just concatenating file-b at the end of file-a and renaming it

Thanks. That worked.

@LoneStriker I would like to suggest you to use the HF transformers integration of GPTQ to do the quanting instead of AutoGPTQ. The transformers does the sharding automatically.
You need latest optimum as it fixes a bug regarding passing pre-tokenized dataset.
Example:

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

def quantize(model_id, bits, group_size, dataset):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    dataset = prepare_dataset(dataset, tokenizer)  

    # you can choose to pass a pre-tokenized dataset, or pass a list of str and the tokenizer object. I personally use the exllamav2 calibration set.
    gptq_config = GPTQConfig(
        bits=bits,
        dataset=dataset,
        group_size=group_size,
        desc_act=True,
        use_cuda_fp16=True,
    )
    model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config)  # this is the quantization step
    model.to("cpu")
    model.config.quantization_config.dataset = None  # workaround a bug as it would try to save the dataset to config, and if the dataset is pre-tokenized it is of type torch.Tensor, which cannot be saved to json.
    model.save_pretrained(f"{model_id}_{bits}bit")
    tokenizer.save_pretrained(f"{model_id}_{bits}bit")

And I think if you exchange GPTQConfig for AWQConfig you can do AWQ, but I haven't tested that.

Thanks. I'll try this next time I run a large GPTQ quant.

Sign up or log in to comment