Qwen
/

Qwen-72B-Chat

@@ -102,12 +102,17 @@ Using vLLM for inference can support longer context lengths and obtain at least
 If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
 ```bash
-pip install vllm
 ```
-否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
-Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
 <br>
 ## 快速使用（Quickstart）
@@ -182,6 +187,7 @@ After installing vLLM according to the dependency section above, you can downloa
 from vllm_wrapper import vLLMWrapper
 model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
 response, history = model.chat(query="你好", history=None)
 print(response)
@@ -204,6 +210,7 @@ If deploying with 2xA100-80G, you can run the following code:
 ```python
 python -m fastchat.serve.controller
 python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
 python -m fastchat.serve.openai_api_server --host localhost --port 8000
 ```
@@ -260,9 +267,9 @@ model = AutoModelForCausalLM.from_pretrained(
 response, history = model.chat(tokenizer, "你好", history=None)
 ```
-注意：vLLM暂不支持gptq量化方案，我们将近期给出解决方案。
-Note: vLLM does not currently support the `gptq` quantization, and we will provide a solution in the near future.
 ### 效果评测
@@ -307,12 +314,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
 |      Int4     | HF + FlashAttn-v2 |        1           |       1         |       2048        |      11.67       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v1 |        1           |       1         |       2048        |      11.27       |         48.86GB        |
 |      Int4     | HF + No FlashAttn |        1           |       1         |       2048        |      11.32       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v2 |        2           |      6144       |       2048        |       6.75       |         85.99GB        |
 |      Int4     | HF + FlashAttn-v1 |        2           |      6144       |       2048        |       6.32       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        2           |      6144       |       2048        |       5.97       |         88.30GB        |
-|      Int4     | HF + FlashAttn-v2 |        3           |     14336       |       2048        |       4.18       |         85.99GB        |
-|      Int4     | HF + FlashAttn-v1 |        3           |     14336       |       2048        |       3.72       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        3           |     14336       |       2048        |       OOM        |            OOM         |
 \* vLLM会提前预分配显存，因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。

 If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
 ```bash
+# pip install vllm  # This line is faster but it does not support quantization models.
+# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
+git clone https://github.com/QwenLM/vllm-gptq
+cd vllm-gptq
+pip install -e .
 ```
+否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)，或者我们[vLLM分支仓库（支持量化模型）](https://github.com/QwenLM/vllm-gptq)。
+Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq).
 <br>
 ## 快速使用（Quickstart）
 from vllm_wrapper import vLLMWrapper
 model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
+# model = vLLMWrapper('Qwen/Qwen-72B-Chat-Int4', tensor_parallel_size=1, dtype="float16")  # 运行int4模型。 run int4 model.
 response, history = model.chat(query="你好", history=None)
 print(response)
 ```python
 python -m fastchat.serve.controller
 python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
+# python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat-Int4 --trust-remote-code --dtype float16  # 运行int4模型。 run int4 model.
 python -m fastchat.serve.openai_api_server --host localhost --port 8000
 ```
 response, history = model.chat(tokenizer, "你好", history=None)
 ```
+注意：使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型，近期将更新。
+Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
 ### 效果评测
 |      Int4     | HF + FlashAttn-v2 |        1           |       1         |       2048        |      11.67       |         48.86GB        |
 |      Int4     | HF + FlashAttn-v1 |        1           |       1         |       2048        |      11.27       |         48.86GB        |
 |      Int4     | HF + No FlashAttn |        1           |       1         |       2048        |      11.32       |         48.86GB        |
+|      Int4     |       vLLM        |        1           |       1         |       2048        |      14.63       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        2           |       1         |       2048        |      20.76       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |       1         |       2048        |      27.19       |      Pre-Allocated*    |
 |      Int4     | HF + FlashAttn-v2 |        2           |      6144       |       2048        |       6.75       |         85.99GB        |
 |      Int4     | HF + FlashAttn-v1 |        2           |      6144       |       2048        |       6.32       |         85.99GB        |
 |      Int4     | HF + No FlashAttn |        2           |      6144       |       2048        |       5.97       |         88.30GB        |
+|      Int4     |       vLLM        |        2           |      6144       |       2048        |      18.07       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |      6144       |       2048        |      24.56       |      Pre-Allocated*    |
+|      Int4     | HF + FlashAttn-v2 |        3           |     14336       |       2048        |       4.18       |        148.73GB        |
+|      Int4     | HF + FlashAttn-v1 |        3           |     14336       |       2048        |       3.72       |        148.73GB        |
 |      Int4     | HF + No FlashAttn |        3           |     14336       |       2048        |       OOM        |            OOM         |
+|      Int4     |       vLLM        |        2           |     14336       |       2048        |     	14.51       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |     14336       |       2048        |      19.28       |      Pre-Allocated*    |
+|      Int4     |       vLLM        |        4           |     30720       |       2048        |      16.93       |      Pre-Allocated*    |
 \* vLLM会提前预分配显存，因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。