Inquiry about Generation Speed

#17
by Boyue27 - opened

I've been experiencing some issues with the generation speed recently and was wondering if anyone else has encountered similar challenges. It seems like the process is slower than usual.

model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping
)
tokenizer =AutoTokenizer.from_pretrained(model_path)
.........................
output_ids = model.generate(input_ids=input_ids.cuda(),
do_sample=True,
temperature=0.4,
top_k=50,
max_new_tokens=300,)

Same here, the generation is very slow.

If the model is offloaded to the CPU, then of course it's going to be slow :/ The model did not change, unless you are computing the loss (which was not working on parallel devices). Make sure output_router_logits is set to False in the config

@Boyue27 your model is most likely offloaded into CPU or disk as stated by Arthur, you need to make sure you load your model in half-precision or 4-bit precision to make sure your model is fit into your GPU device:

For float16:

import torch
from transformers import AutoModelForCausalLM

model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping, torch_dtype=torch.float16
)
tokenizer =AutoTokenizer.from_pretrained(model_path)

4-bit precision (after installing bitsandbytes (pip install bitsandbytes):

import torch
from transformers import AutoModelForCausalLM

model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping, load_in_4bit=True
)
tokenizer =AutoTokenizer.from_pretrained(model_path)

@ybelkada Thank you for your help. I have tested your code and it fixed the problem.

@ArthurZ Thank you for your help and the solution is working great for me.

Sign up or log in to comment