yangapku commited on
Commit
49c7a07
1 Parent(s): dec217a

update content of vllm gptq model

Browse files
Files changed (1) hide show
  1. README.md +22 -7
README.md CHANGED
@@ -102,12 +102,17 @@ Using vLLM for inference can support longer context lengths and obtain at least
102
  If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
103
 
104
  ```bash
105
- pip install vllm
 
 
 
 
 
106
  ```
107
 
108
- 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
109
 
110
- Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
111
  <br>
112
 
113
  ## 快速使用(Quickstart)
@@ -182,6 +187,7 @@ After installing vLLM according to the dependency section above, you can downloa
182
  from vllm_wrapper import vLLMWrapper
183
 
184
  model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
 
185
 
186
  response, history = model.chat(query="你好", history=None)
187
  print(response)
@@ -204,6 +210,7 @@ If deploying with 2xA100-80G, you can run the following code:
204
  ```python
205
  python -m fastchat.serve.controller
206
  python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
 
207
  python -m fastchat.serve.openai_api_server --host localhost --port 8000
208
  ```
209
 
@@ -260,9 +267,9 @@ model = AutoModelForCausalLM.from_pretrained(
260
  response, history = model.chat(tokenizer, "你好", history=None)
261
  ```
262
 
263
- 注意:vLLM暂不支持gptq量化方案,我们将近期给出解决方案。
264
 
265
- Note: vLLM does not currently support the `gptq` quantization, and we will provide a solution in the near future.
266
 
267
  ### 效果评测
268
 
@@ -307,12 +314,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
307
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
308
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
309
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
 
 
 
310
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
311
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
312
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
313
- | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 85.99GB |
314
- | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 85.99GB |
 
 
315
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
 
 
 
316
 
317
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
318
 
 
102
  If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
103
 
104
  ```bash
105
+ # pip install vllm # This line is faster but it does not support quantization models.
106
+
107
+ # The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
108
+ git clone https://github.com/QwenLM/vllm-gptq
109
+ cd vllm-gptq
110
+ pip install -e .
111
  ```
112
 
113
+ 否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html),或者我们[vLLM分支仓库(支持量化模型)](https://github.com/QwenLM/vllm-gptq)
114
 
115
+ Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq).
116
  <br>
117
 
118
  ## 快速使用(Quickstart)
 
187
  from vllm_wrapper import vLLMWrapper
188
 
189
  model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
190
+ # model = vLLMWrapper('Qwen/Qwen-72B-Chat-Int4', tensor_parallel_size=1, dtype="float16") # 运行int4模型。 run int4 model.
191
 
192
  response, history = model.chat(query="你好", history=None)
193
  print(response)
 
210
  ```python
211
  python -m fastchat.serve.controller
212
  python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
213
+ # python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat-Int4 --trust-remote-code --dtype float16 # 运行int4模型。 run int4 model.
214
  python -m fastchat.serve.openai_api_server --host localhost --port 8000
215
  ```
216
 
 
267
  response, history = model.chat(tokenizer, "你好", history=None)
268
  ```
269
 
270
+ 注意:使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型,近期将更新。
271
 
272
+ Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
273
 
274
  ### 效果评测
275
 
 
314
  | Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
315
  | Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
316
  | Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
317
+ | Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
318
+ | Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
319
+ | Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
320
  | Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
321
  | Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
322
  | Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
323
+ | Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
324
+ | Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
325
+ | Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
326
+ | Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
327
  | Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
328
+ | Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
329
+ | Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
330
+ | Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
331
 
332
  \* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
333