CPU seems to be the bottleneck preventing full use of the GPU
#211
by
vangap
- opened
Hello,
I have Llama-3-8B-Instruct running on a L4 GPU(GCP VM). When I am doing the inference, I see the GPU usage around 50%. Digging a little further, I notice that one CPU core is at 100% through out the inference, so I am guessing that this is a bottleneck preventing full usage of the GPU. Upon CPU profiling, I notice that most of this CPU usage is related to libcuda. So, I am wondering if this is normal or if there is something wrong with my env that is leading to this behavior.
Below is my code
self.pipeline = pipeline(
"text-generation",
model="meta-llama/Meta-Llama-3-8B-Instruct",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
prompt = self.pipeline.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
terminators = [
self.pipeline.tokenizer.eos_token_id,
self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
outputs = self.pipeline(
prompt,
max_new_tokens=max_length,
eos_token_id=terminators,
do_sample=False,
temperature=0.0,
top_p=0.9,
)
Just out of curiousity, which version of torch are you using? Because mine is not cuda enabled and I am instructed to recompile the source to enable cuda.