Inference taking too much time
llm = LLM(model="TheBloke/mixtral-8x7b-v0.1-AWQ", quantization="awq", dtype="auto", max_context_len_to_capture=2048, tensor_parallel_size=4, gpu_memory_utilization=0.7, swap_space=16)
sampling_params = SamplingParams(temperature=0.1, top_p=0.15, max_tokens=400)
The prompt basically as to generate answers.
This is run on 4 a10gpu g5.12x.large. But the inference time is 9 seconds for single prompt. This 3 times compared to mistral8X7b-gguf-q4km using llama-cpp-python.
I was hoping vllm would be fast. Any suggestion?
@tariksetia its fast but it’s better with batching, not single prompts.
If you can fit it on gpu, the fastest is exllamav2 or exllama
Hi !
We are comparing throughtput for Mixtral AWQ and Mixtral GPTQ on vLLM. Settings: A100 80Gb, batch size=1, input_len=256.
We found that GPTQ is x2 faster than AWQ. Our apriori was the opposite.
Do you have any idea why AWQ is slower?
Thanks,
@cristian-rodriguez awq is usually not very fast at single batch size which you are doing. Even at larger batches, I believe other quantization formats are slightly faster.