astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit · Not able to generate answer from astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit

Hi!

Thank you for uploading this quantized model, as it allows me to use Llama 3 from Google Colab, as otherwise it's not possible because the original model is too big to fit in Nvidia T4.

I use the following code to load model and generate text:

from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_id = "astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit"

print('Creating QuantizeConfig...')

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False
    )

print('Loading Quantized model...')

model = AutoGPTQForCausalLM.from_quantized(
        model_id,
        use_safetensors=True,
        device="cuda:0",
        quantize_config=quantize_config)

print('Loading Tokenizer model...')

tokenizer = AutoTokenizer.from_pretrained(model_id)

print('Creating Pipeline...')

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

rompt = "What is the capital of Indonesia?"

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": prompt},
]

outputs = pipe(messages,
               max_new_tokens=256,
               eos_token_id=terminators,
               do_sample=True,
               temperature=0.6,
               top_p=0.9)
print(outputs[0]["generated_text"][-1])

However the output is:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '】,【\x00\x0c】,【\\",\\"gers\x00】,\x00`,{"gorithms`\n\n`.\n\n»\n\n»\n\n`,`},»\n\n`\n\n}\n\n».\n\n`]("><}[],](`\n\n],[`,]]["],"),\n\n」\n\n},{"]\n\n`,}\n\n}\n\n%),»\n\n],}\n\n`\n\n)</」\n\n```\n\n`,`,\n`\n\n`\n\n`.\n\n},{{"`\n\n`.\n\n`,>`{"{"},`,}\n\n</],\n\n{"],"}`>\n\n».\n\n».\n\n)`]\n\n%`,»{"}\n\n"`].\n\n`\n\n}\n\n])\n\n}`,`\n\n`\n\n«],{"`,`\n\n`\n\nAE)`]\n\n}\n\n``,»\n\n`,.\n\n`.\n\n`,`\n\n}\n\n>,"""\n\n]>`,{"`,)``,``»`,)\n\n\n`,{"`,{"}\n\n))\n\n]({"`,]]`\n\n\n\n\n\n\n\n\n\n%,``.\n\n\n\n\n`,\n\n\n`\n\n`\n\n.\n\n\n``\n\n`\n\n]\n\n`,]( "\n\n\n.\n\n\n#](}}`\n\n<<{"))\n\n%)`%\n\n\n.\n\n\n\n\n\n`]`,]]}\n\n.\n\n\n`\n\nD<<``][B»!\n\n>\n\n`\n\n`,]]¢]\n\n`\n\n\n\n``»\n\n\n`\n\n\n\n]]\n\n}\n\n%]][<``]\n\n.\n\n\n\n\n<<=`]`,{}`\n\n\n\n[/`\n\n¢}=)\n\n.\n\n}\n\n}\n\n\n\n](%}<AE}`\n\n``»>\n\n%G``\n\n\n\n\nAE\n\n\n]\n\n"""\n\n)<\n\n\n\n\n`\n\n]][](\n\n#%}>\n\n``>\n\n\n\n»`\n\n`\n\n'}

I used a different system message:

    {"role": "system", "content": "You are a helpful assistant."},

with similar garbled output:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '\x00】,【\x00`,\x00\x00\\"><},},{\x00`\n\n\\"],」\n\n`.\n\n»\n\n\\",\\"】,【},`](«】,【],[】,【»\n\n},`\n\n](}\n\n»\n\n}[`,``,},"></«"""\n\n]][».\n\n`.\n\n»,%),"`},`]\n\n`,`,»`,`,»`,"],"},]\n\n},}`>``,]{.\n\n`.\n\n},"\n\n].\n\n`\n\n],"},],\n\n],%,},{"%,]]`,`\n\n{}`\n\n%```,.\n\n»]\n\n`,,"»%`,},]\n\n»},.\n\n\n`,}\n\n`\n\n`.\n\n]\n\n},}%``\n\n\n.\n\n\n``.\n\nologists},>`\n\n\n``\n\n<<AE`,`\n\n`,.\n\n`,```,>,""``,»}](`\n\n€`,>\n\n{"F«{"`,}]\n\n`\n\n\n\n\n]\n\n)``\n\n>`»,``}}\n\n\n\n<<\n\n\n]]`.\n\n}\n\n`\n\nG\n\n\n`\n\n},{"]]},`\n\n`,]>`\n\n\n`AE`\n\n\n\n\nAE<<))\n\n\n\n`,<<\n\n<B "\n\n),»\n\n\n\n`}\n\n#F]AE`\n\n}\n\n`,\n\n\n</]][\n\n\n]][],<<],¢\n\n\n=`\n\n,{{\n\n\n}\n\n\n\n\n\n.\n\n<<\x00\n\n\n%]]\n\n\n\n\n\n\n»`A]][.\n\n\n\n\n]]\n\n)\n\n`\n\n\n\n"""\n\n\n\n]]\n\n</{"```\n\n\n@C\n\n\n`\n\n</]]`>\n\n\n\n\n\n\n\n\n'}

A simple string prompt:

# prompt = "What is a large language model?"
prompt = "What is the capital of Indonesia?"

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipe(prompt,
               max_new_tokens=256,
               eos_token_id=terminators,
               do_sample=True,
               temperature=0.6,
               top_p=0.9)
print(outputs[0]["generated_text"][-1])

And a single output:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
s

Can you suggest where I did wrong?