TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ · Inference taking too much time

Feb 20

llm = LLM(model="TheBloke/mixtral-8x7b-v0.1-AWQ",  quantization="awq", dtype="auto", max_context_len_to_capture=2048, tensor_parallel_size=4, gpu_memory_utilization=0.7, swap_space=16)
sampling_params = SamplingParams(temperature=0.1, top_p=0.15, max_tokens=400)

The prompt basically as to generate answers.

This is run on 4 a10gpu g5.12x.large. But the inference time is 9 seconds for single prompt. This 3 times compared to mistral8X7b-gguf-q4km using llama-cpp-python.

I was hoping vllm would be fast. Any suggestion?

YaTharThShaRma999

Feb 21

@tariksetia its fast but it’s better with batching, not single prompts.

If you can fit it on gpu, the fastest is exllamav2 or exllama

cristian-rodriguez

Mar 20

Hi !

We are comparing throughtput for Mixtral AWQ and Mixtral GPTQ on vLLM. Settings: A100 80Gb, batch size=1, input_len=256.

We found that GPTQ is x2 faster than AWQ. Our apriori was the opposite.

Do you have any idea why AWQ is slower?

Thanks,

YaTharThShaRma999

Mar 20

@cristian-rodriguez awq is usually not very fast at single batch size which you are doing. Even at larger batches, I believe other quantization formats are slightly faster.