Falcon 7b instruct using cpu for inference even on NVIDIA A40 cards with 50GB VRAM

#70

by Akshadv - opened Jul 20, 2023

Discussion

Akshadv

Jul 20, 2023

•

edited Jul 20, 2023

tokenizer = AutoTokenizer.from_pretrained(model_path)

falcon_pipeline = pipeline(
"text-generation",
model=model_path,
tokenizer=tokenizer,
max_new_tokens=256,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map = 'auto',
do_sample=True,
top_k = 10,
temperature=0.7,
eos_token_id=tokenizer.eos_token_id
)

using this code + llmchain for inference am i doing something wrong or any thing needs to be fixed to get full inference on gpu?

CPU is always hitting 100%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment