Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision
gptq

Success using AutoGPTQ?

#19
by joshlevy89 - opened

I'm trying to load the pretrained model using AutoGPTQ but getting error.
I'm on Colab (T4)

Update: I am still getting the error below for this model, and the non-RLHF'ed, safetensors version of vicuna (vicuna-13B-1.1-GPTQ-4bit-128g.latest.safetensors).
However, I did get the no-act-order, pt version of the latter to work (vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt).
So I guess it's something to do with compatibility with certain types of quantizations but not others...maybe some configurations arguments need to be passed to make it work?

Code:
!pip install auto-gptq[llama]
from auto_gptq import AutoGPTQForCausalLM
device = "cuda:0"
MODEL_DIR = "/content/drive/MyDrive/AI/projects/lie_detector/TheBloke/stable-vicuna-13B-GPTQ"
MODEL_FILE = "stable-vicuna-13B-GPTQ-4bit.compat.no-act-order.safetensors"
model = AutoGPTQForCausalLM.from_quantized(MODEL_DIR, model_basename=MODEL_FILE, use_triton=True)

Note: I had to append .bin to the model file name because it seems that AutoGPTQ does this automatically.

Error:
UnpicklingError: invalid load key, '\xa8'.

Full Trace:
WARNING:auto_gptq.modeling._base:use_triton will force moving the whole model to GPU, make sure you have enough VRAM.
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ in <cell line: 8>:8 โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py:63 in from_quantized โ”‚
โ”‚ โ”‚
โ”‚ 60 โ”‚ โ”‚ trust_remote_code: bool = False โ”‚
โ”‚ 61 โ”‚ ) -> BaseGPTQForCausalLM: โ”‚
โ”‚ 62 โ”‚ โ”‚ model_type = check_and_get_model_type(save_dir) โ”‚
โ”‚ โฑ 63 โ”‚ โ”‚ return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized( โ”‚
โ”‚ 64 โ”‚ โ”‚ โ”‚ save_dir=save_dir, โ”‚
โ”‚ 65 โ”‚ โ”‚ โ”‚ device=device, โ”‚
โ”‚ 66 โ”‚ โ”‚ โ”‚ use_safetensors=use_safetensors, โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:544 in from_quantized โ”‚
โ”‚ โ”‚
โ”‚ 541 โ”‚ โ”‚ if not max_memory and not device_map: โ”‚
โ”‚ 542 โ”‚ โ”‚ โ”‚ device_map = {"": device} โ”‚
โ”‚ 543 โ”‚ โ”‚ โ”‚
โ”‚ โฑ 544 โ”‚ โ”‚ model = accelerate.load_checkpoint_and_dispatch( โ”‚
โ”‚ 545 โ”‚ โ”‚ โ”‚ model, model_save_name, device_map, max_memory, no_split_module_classes=[cls โ”‚
โ”‚ 546 โ”‚ โ”‚ ) โ”‚
โ”‚ 547 โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py:479 in โ”‚
โ”‚ load_checkpoint_and_dispatch โ”‚
โ”‚ โ”‚
โ”‚ 476 โ”‚ โ”‚ ) โ”‚
โ”‚ 477 โ”‚ if offload_state_dict is None and device_map is not None and "disk" in device_map.va โ”‚
โ”‚ 478 โ”‚ โ”‚ offload_state_dict = True โ”‚
โ”‚ โฑ 479 โ”‚ load_checkpoint_in_model( โ”‚
โ”‚ 480 โ”‚ โ”‚ model, โ”‚
โ”‚ 481 โ”‚ โ”‚ checkpoint, โ”‚
โ”‚ 482 โ”‚ โ”‚ device_map=device_map, โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py:971 in โ”‚
โ”‚ load_checkpoint_in_model โ”‚
โ”‚ โ”‚
โ”‚ 968 โ”‚ buffer_names = [name for name, _ in model.named_buffers()] โ”‚
โ”‚ 969 โ”‚ โ”‚
โ”‚ 970 โ”‚ for checkpoint_file in checkpoint_files: โ”‚
โ”‚ โฑ 971 โ”‚ โ”‚ checkpoint = load_state_dict(checkpoint_file, device_map=device_map) โ”‚
โ”‚ 972 โ”‚ โ”‚ if device_map is None: โ”‚
โ”‚ 973 โ”‚ โ”‚ โ”‚ model.load_state_dict(checkpoint, strict=False) โ”‚
โ”‚ 974 โ”‚ โ”‚ else: โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py:873 in load_state_dict โ”‚
โ”‚ โ”‚
โ”‚ 870 โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ 871 โ”‚ โ”‚ โ”‚ return tensors โ”‚
โ”‚ 872 โ”‚ else: โ”‚
โ”‚ โฑ 873 โ”‚ โ”‚ return torch.load(checkpoint_file) โ”‚
โ”‚ 874 โ”‚
โ”‚ 875 โ”‚
โ”‚ 876 def load_checkpoint_in_model( โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/torch/serialization.py:815 in load โ”‚
โ”‚ โ”‚
โ”‚ 812 โ”‚ โ”‚ โ”‚ โ”‚ return _legacy_load(opened_file, map_location, _weights_only_unpickler, โ”‚
โ”‚ 813 โ”‚ โ”‚ โ”‚ except RuntimeError as e: โ”‚
โ”‚ 814 โ”‚ โ”‚ โ”‚ โ”‚ raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None โ”‚
โ”‚ โฑ 815 โ”‚ โ”‚ return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args โ”‚
โ”‚ 816 โ”‚
โ”‚ 817 โ”‚
โ”‚ 818 # Register pickling support for layout instances such as โ”‚
โ”‚ โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/torch/serialization.py:1033 in _legacy_load โ”‚
โ”‚ โ”‚
โ”‚ 1030 โ”‚ โ”‚ โ”‚ f"Received object of type "{type(f)}". Please update to Python 3.8.2 or ne โ”‚
โ”‚ 1031 โ”‚ โ”‚ โ”‚ "functionality.") โ”‚
โ”‚ 1032 โ”‚ โ”‚
โ”‚ โฑ 1033 โ”‚ magic_number = pickle_module.load(f, **pickle_load_args) โ”‚
โ”‚ 1034 โ”‚ if magic_number != MAGIC_NUMBER: โ”‚
โ”‚ 1035 โ”‚ โ”‚ raise RuntimeError("Invalid magic number; corrupt file?") โ”‚
โ”‚ 1036 โ”‚ protocol_version = pickle_module.load(f, **pickle_load_args)
UnpicklingError: invalid load key, '\xa8'.

OK firstly please switch to the faster-llama branch of AutoGPTQ, ie do the following:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout faster-llama
pip install .

Then test this model with this code:

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

quantized_model_dir = "/workspace/models/TheBloke_stable-vicuna-13B-GPTQ"

model_basename = "stable-vicuna-13B-GPTQ-4bit.compat.no-act-order"

use_strict = False

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True)

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False
    )

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
        use_safetensors=True,
        strict=use_strict,
        model_basename=model_basename,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=quantize_config)

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

Example output:

root@d2264a66ab0b:/workspace# python simple_gptq_anon.py
*** Pipeline:
### Human: Tell me about AI
### Assistant: Artificial Intelligence (AI) is a field of computer science that focuses on creating intelligent machines that can think and act like humans. It involves the development of algorithms, models, and systems that enable computers to learn from data, make decisions, and perform tasks without human intervention. Some examples of AI include natural language processing, machine learning, robotics, and computer vision. The potential applications of AI are vast and range from healthcare and finance to transportation and entertainment. However, there are also concerns around the impact of AI on employment and privacy.


*** Generate:
<s> ### Human: Tell me about AI
### Assistant: AI stands for Artificial Intelligence. It is a field of computer science that focuses on creating intelligent machines that can think and act like humans. AI involves the development of algorithms and models that enable computers to learn from data, make predictions, and take actions based on that data. AI is used in a variety of applications, including natural language processing, computer vision, robotics, and more.

Some of the key areas of AI research include:

- Machine learning: This involves the development of algorithms that enable computers to learn from data and make predictions or take actions based on that data.

- Natural language processing: This involves the development of algorithms that enable computers to understand and generate human language.

- Computer vision: This involves the development of algorithms that enable computers to interpret and analyze visual data, such as images and videos.

- Robotics: This involves the development of algorithms and models that enable robots to perform tasks autonomously.

- AI ethics: This involves the study of the ethical implications of AI, such as the potential impact of AI on society and the need for ethical guidelines in AI development.

Overall, AI has the potential to revolutionize many industries and transform the way we live and work. However, there are also concerns about the potential negative impacts of AI, such as job displacement and the potential for AI to be used for malicious purposes.</s>

Same should work for the old Vicuna model - but yes you have to rename to .bin (or symlink). I will eventually do that in the repo itself - or probably just replace the model with a .safetensors instead.

TheBloke changed discussion status to closed
TheBloke changed discussion status to open

(sorry didn't mean to close the thread there)

Thank you. Been encountering same mistake.

@TheBloke Major thanks!! It worked

A few notes for future readers: (1) faster-llama has been merged to mainline (2) if using use_safetensors, you now don't need to append .bin to the actual file name...you do need to remove ".safetensors" from the model_basename because the code auto adds it (3) to get vanilla vicuna (i.e. not stable-vicuna) to work, I had to change desc_act to True in BaseQuantizeConfig...this is a variable to mindful of.

Yeah re 3 that's my fault for having two model files in one repo, with only one quantize_config.json. On all my recent repos I use separate branches for separate model files.

I keep meaning to clean that up and will do soon.

Glad you got it working!

joshlevy89 changed discussion status to closed

Sign up or log in to comment