Did anyone get it to run?

by dimaischenko - opened Dec 12, 2023

Dec 12, 2023

•

edited Dec 12, 2023

Did anyone get it to run? My setup:

cuda 11.7, RTX3090 24 Gb

torch==2.1.1+cu118
transformers==4.36.0
auto-gptq==0.6.0.dev0+cu118  [from source:  https://github.com/LaaZa/AutoGPTQ/tree/Mixtral]

Try to load:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
                "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
                model_basename="model",
                revision="gptq-3bit-128g-actorder_True",
                strict=False,  # Tried with and without this parameter. The result is the same
                use_triton=False,
                use_safetensors=True,
                trust_remote_code=False,
                device="cuda:0",
                disable_exllama=True,
                disable_exllamav2=True,
                quantize_config=None)

Get error:

File "/root/venv/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 276, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: QuantLinear() does not have a parameter or a buffer named weight.

dimaischenko

Dec 12, 2023

Tried the same but with CUDA 12.1 , torch==2.1.1+cu121 and built auto-gptq==0.6.0.dev0+cu121 from source. The same error.

TheBloke

Owner Dec 12, 2023

Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.

The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

tsalvoch

Dec 13, 2023

Build AutoGPT OK with CUDA 12.1, transformers 4.36.0 and torch==2.1.1+cu121 = auto-gptq==0.6.0.dev0+cu121
But model loading failed in text-generation-webui:

Traceback (most recent call last):
File "/home/me/text-generation-webui/modules/ui_model_menu.py", line 208, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 89, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 380, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/AutoGPTQ_loader.py", line 58, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 102, in from_quantized
model_type = check_and_get_model_type(model_name_or_path, trust_remote_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/_utils.py", line 232, in check_and_get_model_type
raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: mixtral isn't supported yet.

I probably missed something to have that:
mixtral isn't supported yet

But what?

dimaischenko

Dec 13, 2023

@tsalvoch most likely you did not build auto-gptq from the Mixtral-fix git branch. I had the same error when I built it from the master branch

https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

git checkout Mixtral-fix

dimaischenko

Dec 13, 2023

Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.

The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix

@TheBloke Thank you!

bdambrosio

Dec 13, 2023

@dimaischenko - How did you get this to run on a 3090? with Mixtral-fix it does try to load, but runs out of memory on my 4090.
I do have 2x4090, guess I'll look through the code base to see if/how to specify multiple gpu.

dimaischenko

Dec 13, 2023

@bdambrosio I am ok with 3090. Even for revision="main", but you can try revision="gptq-3bit-128g-actorder_True" it takes about 19 Gb (example in my first thread post)

bdambrosio

Dec 13, 2023

Ah, yup, just realized my error. I had loaded a larger version assuming I would use both gpus. Downloading smaller version now, while also trying to figure out syntax of AutoGPTQ .from_pretrained device parameter.

tnx!

bdambrosio

Dec 13, 2023

•

edited Dec 13, 2023

Ah - In case anyone else stumbles here - @TheBloke - any ideas?

gptq-4bit-128g-actorder_True 4:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename="model",
use_safetensors=True,
per_gpu_max_memory={0:"20GIB",1:"20GIB"},
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=fquotequotequote[INST] {prompt} [/INST]
print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.1, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

(mistral) bruce@bruce-AI:~/Downloads/alphawave/tests/Sam$ python mixtral-8x-GPTQ.py
MixtralGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
MixtralGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.

*** Generate:
Traceback (most recent call last):
File "/home/bruce/Downloads/alphawave/tests/Sam/mixtral-8x-GPTQ.py", line 31, in
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
File "/home/bruce/miniconda3/envs/mistral/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 447, in generate
return self.model.generate(**kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2897, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0

robert1968

Dec 16, 2023

This comment has been hidden

robert1968

Dec 16, 2023

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment