Question: Maximizing GPU Utilization for Inference
#24
by
ric1732
- opened
I have a code snippet for text generation using Hugging Face's Transformers library. I am running inference on a machine with 8 GPUs. However, during inference, only 2 or 3 GPUs are being utilized and the GPU utilization remains below 32%. I want to optimize my code to utilize the full power of all available all 8 GPUs.
Here is the code I am currently using:
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
batch_size = 32
if __name__=='__main__':
txt_list = [
"The King is dead. Long live the Queen.",
"Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
"The story so far: in the beginning, the universe was created.",
"It was a bright cold day in April, and the clocks were striking thirteen.",
"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
"The sweat wis lashing oafay Sick Boy; he wis trembling.",
"124 was spiteful. Full of Baby's venom.",
"As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
"I write this sitting in the kitchen sink.",
"We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
] * 500
lf = len(txt_list)
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-v0.1')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-v0.1', device_map='auto')
out_list = []
n_steps = math.ceil(lf/batch_size)
for btx in range(n_steps):
t_sens = txt_list[btx*batch_size:(btx+1)*batch_size]
t_toks = tokenizer(t_sens, return_tensors='pt', padding=True).to('cuda')
opt = model.generate(**t_toks, max_new_tokens=200)
for jty in range(batch_size):
ctxt = tokenizer.decode(opt[jty], skip_special_tokens=True)
ctxt = ctxt[len(t_sens[jty]):].strip()
out_list.append({'input':t_sens[jty], 'output':ctxt})
str_list = [json.dumps(xx) for xx in out_list]
otf = open('rrr','w')
otf.write('\n'.join(str_list))
otf.close()
While the code is functional, the GPU utilization is restricted to at most 3 GPUs and not reaching its full potential. How can I modify the code to ensure maximum GPU utilization during the inference step?
Thank you!
ric1732
changed discussion title from
Question: Maximizing GPU Utilization for Inference with Transformers
to Question: Maximizing GPU Utilization for Inference