My reasoning keeps repeating. How do I conclude with <<|endoftext|>>
I use your code for inference, I successfully loaded the model, but in the process of inference will keep answering the same question, I only need one inference result, what should I do?
#Load model
quantized_model_dir = "falcon"
model_basename = "gptq_model-4bit--1g"
use_strict = False
use_triton = False
print("Loading tokenizer")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
trust_remote_code=True,
use_safetensors=True,
strict=use_strict,
torch_dtype=torch.float32,
device="cuda:3",
use_triton=use_triton,
)
inference
input_ids = tokenizer(prompt_template, return_tensors='pt').to("cuda:3").input_ids
output = model.generate(inputs=input_ids,
temperature=0.01,
do_sample=True,
max_new_tokens=20)
response=tokenizer.decode(output[0])
result
Instruction: 1+1
Response:2<|endoftext|>The answer to 1+1 is 2.<|endoftext|>#1 is the first number
( I don't need this)
i just need answer (2) How do I truncate model generation after <|endoftext|>?
tks~
This issue helped me figure it out - https://github.com/huggingface/transformers/issues/22794#issuecomment-1598977285