Falcon 7b instruct using cpu for inference even on NVIDIA A40 cards with 50GB VRAM
#70
by
Akshadv
- opened
tokenizer = AutoTokenizer.from_pretrained(model_path)
falcon_pipeline = pipeline(
"text-generation",
model=model_path,
tokenizer=tokenizer,
max_new_tokens=256,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map = 'auto',
do_sample=True,
top_k = 10,
temperature=0.7,
eos_token_id=tokenizer.eos_token_id
)
using this code + llmchain for inference am i doing something wrong or any thing needs to be fixed to get full inference on gpu?
CPU is always hitting 100%