TheBloke/falcon-40b-instruct-GPTQ · My reasoning keeps repeating. How do I conclude with <<|endoftext|>>

Jun 6, 2023

I use your code for inference, I successfully loaded the model, but in the process of inference will keep answering the same question, I only need one inference result, what should I do?

#Load model
quantized_model_dir = "falcon"
model_basename = "gptq_model-4bit--1g"
use_strict = False
use_triton = False
print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
trust_remote_code=True,
use_safetensors=True,
strict=use_strict,
torch_dtype=torch.float32,
device="cuda:3",
use_triton=use_triton,
)

inference

input_ids = tokenizer(prompt_template, return_tensors='pt').to("cuda:3").input_ids
output = model.generate(inputs=input_ids,
temperature=0.01,
do_sample=True,
max_new_tokens=20)
response=tokenizer.decode(output[0])

result

Instruction: 1+1

Response:2<|endoftext|>The answer to 1+1 is 2.<|endoftext|>#1 is the first number

                                                       ( I don't need this)

i just need answer (2) How do I truncate model generation after <|endoftext|>?

tks~

Ares09 changed discussion status to closed Jun 6, 2023

pdakin

Jun 20, 2023

@Ares09 how did you resolve this?

pdakin

Jun 22, 2023

This issue helped me figure it out - https://github.com/huggingface/transformers/issues/22794#issuecomment-1598977285