Did anyone get it to run?
Did anyone get it to run? My setup:
cuda 11.7, RTX3090 24 Gb
torch==2.1.1+cu118
transformers==4.36.0
auto-gptq==0.6.0.dev0+cu118 [from source: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral]
Try to load:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ",
model_basename="model",
revision="gptq-3bit-128g-actorder_True",
strict=False, # Tried with and without this parameter. The result is the same
use_triton=False,
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
disable_exllama=True,
disable_exllamav2=True,
quantize_config=None)
Get error:
File "/root/venv/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 276, in set_module_tensor_to_device
raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: QuantLinear() does not have a parameter or a buffer named weight.
Tried the same but with CUDA 12.1
, torch==2.1.1+cu121
and built auto-gptq==0.6.0.dev0+cu121
from source. The same error.
Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.
The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix
Build AutoGPT OK with CUDA 12.1, transformers 4.36.0 and torch==2.1.1+cu121 = auto-gptq==0.6.0.dev0+cu121
But model loading failed in text-generation-webui:
Traceback (most recent call last):
File "/home/me/text-generation-webui/modules/ui_model_menu.py", line 208, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 89, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/models.py", line 380, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/text-generation-webui/modules/AutoGPTQ_loader.py", line 58, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 102, in from_quantized
model_type = check_and_get_model_type(model_name_or_path, trust_remote_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/me/miniconda3/envs/textgen/lib/python3.11/site-packages/auto_gptq/modeling/_utils.py", line 232, in check_and_get_model_type
raise TypeError(f"{config.model_type} isn't supported yet.")
TypeError: mixtral isn't supported yet.
I probably missed something to have that:
mixtral isn't supported yet
But what?
@tsalvoch
most likely you did not build auto-gptq
from the Mixtral-fix
git branch. I had the same error when I built it from the master
branch
https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix
git checkout Mixtral-fix
Unfortunately there was an issue with the branch I linked; I didn't realise that the author had made another commit to it which broke inference again. I've now updated the README to reference a different branch.
The newly linked PR will now work: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral-fix
@TheBloke Thank you!
@dimaischenko
- How did you get this to run on a 3090? with Mixtral-fix it does try to load, but runs out of memory on my 4090.
I do have 2x4090, guess I'll look through the code base to see if/how to specify multiple gpu.
@bdambrosio
I am ok with 3090. Even for revision="main"
, but you can try revision="gptq-3bit-128g-actorder_True"
it takes about 19 Gb (example in my first thread post)
Ah, yup, just realized my error. I had loaded a larger version assuming I would use both gpus. Downloading smaller version now, while also trying to figure out syntax of AutoGPTQ .from_pretrained device parameter.
tnx!
Ah - In case anyone else stumbles here - @TheBloke - any ideas?
gptq-4bit-128g-actorder_True 4:
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename="model",
use_safetensors=True,
per_gpu_max_memory={0:"20GIB",1:"20GIB"},
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=False)
prompt = "Tell me about AI"
prompt_template=fquotequotequote[INST] {prompt} [/INST]
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.1, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
(mistral) bruce@bruce-AI:~/Downloads/alphawave/tests/Sam$ python mixtral-8x-GPTQ.py
MixtralGPTQForCausalLM hasn't fused attention module yet, will skip inject fused attention.
MixtralGPTQForCausalLM hasn't fused mlp module yet, will skip inject fused mlp.
*** Generate:
Traceback (most recent call last):
File "/home/bruce/Downloads/alphawave/tests/Sam/mixtral-8x-GPTQ.py", line 31, in
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
File "/home/bruce/miniconda3/envs/mistral/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 447, in generate
return self.model.generate(**kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
return self.sample(
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2897, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0