deepspeed inference tensor parallelism memory footprint doesn't decrease with deepspeed tp_size increase.
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token # Set pad_token to eos_token
model.eval()
ds_engine = deepspeed.init_inference(model,
tensor_parallel={"tp_size": world_size},
#dtype=torch.float32,
dtype=torch.float16,
replace_with_kernel_inject=True)
model = ds_engine.module
model.eval()
generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
return generator
according to this official link https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/ with tp size increase, the memory allocated on each gpu will be decreased. However for mistral model, it doesn't work. I tested gpt-j 6B and llama model 7B models with exactly the same code, both of them are working (GPU memory allocation decrease with tp size increase). What's wrong with mistral model? I tried the deepspeed-mii too, the conclusion is the same.
anyone can throw some lights on this?
they do not support mistral models with the old inference engine. so you should try to use the latest inference engine DeepSpeed-MII. Here's an example for running a mistral model:
import mii
pipe = mii.pipeline("mistralai/Mistral-7B-v0.1")
response = pipe(["DeepSpeed is"], max_new_tokens=128)
print(response)
in the backend of mii, the inference is still backed by deepspeed inference engine. I tried, it is the same. no memory footprint reduction along the tensor parallism with the following code.
import argparse
import mii
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()
mii.serve(args.model, tensor_parallel=args.tensor_parallel)
print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run python client.py --model {args.model}
to connect.")
print(f"Run python terminate.py --model {args.model}
to terminate.")=
main()
I have the same problem, have you solved it?
Not yet. I don't deepspeed support mistral tensor parallelism.
Can we quantize the model and run deepspeed MII ?