Facing error while inferencing
#14
by
sauravm8
- opened
I am facing an inference errorThe size of tensor a (2048) must match the size of tensor b (2049) at non-singleton dimension 3
I am cutting off the input prompt way before 2048 tokens at around 1250 words (~1500 tokens)
`final_message = f"<|prompter|>What do you think of the following and keep it under 100 words: \n {received_message}<|endoftext|><|assistant|>"
inputs = tokenizer(final_message, return_tensors="pt").to(model.device)
print(f"Number of tokens --> {inputs['input_ids'].shape[1]}")
tokens = model.generate(**inputs, max_new_tokens=1000, do_sample=True, temperature=0.8)
response = tokenizer.decode(tokens[0]).split("<|assistant|>")[1].strip("<|endoftext|>")
print(f"Total time taken is {time.time()-start_query_time}")
print(response)`
GPU is not an issue. What is going on here?
I think I figured it out, even the new tokens getting generated have to be within the 2048 token limit. Otherwise, it crashes. Can it not have a streaming window of memory? Am I missing something?