update content of vllm gptq model
Browse files
README.md
CHANGED
@@ -102,12 +102,17 @@ Using vLLM for inference can support longer context lengths and obtain at least
|
|
102 |
If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
|
103 |
|
104 |
```bash
|
105 |
-
pip install vllm
|
|
|
|
|
|
|
|
|
|
|
106 |
```
|
107 |
|
108 |
-
否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html)。
|
109 |
|
110 |
-
Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html).
|
111 |
<br>
|
112 |
|
113 |
## 快速使用(Quickstart)
|
@@ -182,6 +187,7 @@ After installing vLLM according to the dependency section above, you can downloa
|
|
182 |
from vllm_wrapper import vLLMWrapper
|
183 |
|
184 |
model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
|
|
|
185 |
|
186 |
response, history = model.chat(query="你好", history=None)
|
187 |
print(response)
|
@@ -204,6 +210,7 @@ If deploying with 2xA100-80G, you can run the following code:
|
|
204 |
```python
|
205 |
python -m fastchat.serve.controller
|
206 |
python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
|
|
|
207 |
python -m fastchat.serve.openai_api_server --host localhost --port 8000
|
208 |
```
|
209 |
|
@@ -260,9 +267,9 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
260 |
response, history = model.chat(tokenizer, "你好", history=None)
|
261 |
```
|
262 |
|
263 |
-
|
264 |
|
265 |
-
Note: vLLM
|
266 |
|
267 |
### 效果评测
|
268 |
|
@@ -307,12 +314,20 @@ We measured the average inference speed and GPU memory usage of generating 2048
|
|
307 |
| Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
|
308 |
| Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
|
309 |
| Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
|
|
|
|
|
|
|
310 |
| Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
|
311 |
| Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
|
312 |
| Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
|
313 |
-
| Int4 |
|
314 |
-
| Int4 |
|
|
|
|
|
315 |
| Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
|
|
|
|
|
|
|
316 |
|
317 |
\* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
|
318 |
|
|
|
102 |
If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM.
|
103 |
|
104 |
```bash
|
105 |
+
# pip install vllm # This line is faster but it does not support quantization models.
|
106 |
+
|
107 |
+
# The below lines support int4 quantization (int8 will be supported soon). The installation are slower (~10 minutes).
|
108 |
+
git clone https://github.com/QwenLM/vllm-gptq
|
109 |
+
cd vllm-gptq
|
110 |
+
pip install -e .
|
111 |
```
|
112 |
|
113 |
+
否则请参考vLLM官方的[安装说明](https://docs.vllm.ai/en/latest/getting_started/installation.html),或者我们[vLLM分支仓库(支持量化模型)](https://github.com/QwenLM/vllm-gptq)。
|
114 |
|
115 |
+
Otherwise, please refer to the official vLLM [Installation Instructions](https://docs.vllm.ai/en/latest/getting_started/installation.html), or our [vLLM repo for GPTQ quantization](https://github.com/QwenLM/vllm-gptq).
|
116 |
<br>
|
117 |
|
118 |
## 快速使用(Quickstart)
|
|
|
187 |
from vllm_wrapper import vLLMWrapper
|
188 |
|
189 |
model = vLLMWrapper('Qwen/Qwen-72B-Chat', tensor_parallel_size=2)
|
190 |
+
# model = vLLMWrapper('Qwen/Qwen-72B-Chat-Int4', tensor_parallel_size=1, dtype="float16") # 运行int4模型。 run int4 model.
|
191 |
|
192 |
response, history = model.chat(query="你好", history=None)
|
193 |
print(response)
|
|
|
210 |
```python
|
211 |
python -m fastchat.serve.controller
|
212 |
python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat --trust-remote-code --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --dtype bfloat16
|
213 |
+
# python -m fastchat.serve.vllm_worker --model-path Qwen/Qwen-72B-Chat-Int4 --trust-remote-code --dtype float16 # 运行int4模型。 run int4 model.
|
214 |
python -m fastchat.serve.openai_api_server --host localhost --port 8000
|
215 |
```
|
216 |
|
|
|
267 |
response, history = model.chat(tokenizer, "你好", history=None)
|
268 |
```
|
269 |
|
270 |
+
注意:使用vLLM运行量化模型需安装我们[vLLM分支仓库](https://github.com/QwenLM/vllm-gptq)。暂不支持int8模型,近期将更新。
|
271 |
|
272 |
+
Note: You need to install our [vLLM repo] (https://github.com/qwenlm/vllm-gptq) for AutoGPTQ. The int8 model is not supported for the time being, and we will add the support soon.
|
273 |
|
274 |
### 效果评测
|
275 |
|
|
|
314 |
| Int4 | HF + FlashAttn-v2 | 1 | 1 | 2048 | 11.67 | 48.86GB |
|
315 |
| Int4 | HF + FlashAttn-v1 | 1 | 1 | 2048 | 11.27 | 48.86GB |
|
316 |
| Int4 | HF + No FlashAttn | 1 | 1 | 2048 | 11.32 | 48.86GB |
|
317 |
+
| Int4 | vLLM | 1 | 1 | 2048 | 14.63 | Pre-Allocated* |
|
318 |
+
| Int4 | vLLM | 2 | 1 | 2048 | 20.76 | Pre-Allocated* |
|
319 |
+
| Int4 | vLLM | 4 | 1 | 2048 | 27.19 | Pre-Allocated* |
|
320 |
| Int4 | HF + FlashAttn-v2 | 2 | 6144 | 2048 | 6.75 | 85.99GB |
|
321 |
| Int4 | HF + FlashAttn-v1 | 2 | 6144 | 2048 | 6.32 | 85.99GB |
|
322 |
| Int4 | HF + No FlashAttn | 2 | 6144 | 2048 | 5.97 | 88.30GB |
|
323 |
+
| Int4 | vLLM | 2 | 6144 | 2048 | 18.07 | Pre-Allocated* |
|
324 |
+
| Int4 | vLLM | 4 | 6144 | 2048 | 24.56 | Pre-Allocated* |
|
325 |
+
| Int4 | HF + FlashAttn-v2 | 3 | 14336 | 2048 | 4.18 | 148.73GB |
|
326 |
+
| Int4 | HF + FlashAttn-v1 | 3 | 14336 | 2048 | 3.72 | 148.73GB |
|
327 |
| Int4 | HF + No FlashAttn | 3 | 14336 | 2048 | OOM | OOM |
|
328 |
+
| Int4 | vLLM | 2 | 14336 | 2048 | 14.51 | Pre-Allocated* |
|
329 |
+
| Int4 | vLLM | 4 | 14336 | 2048 | 19.28 | Pre-Allocated* |
|
330 |
+
| Int4 | vLLM | 4 | 30720 | 2048 | 16.93 | Pre-Allocated* |
|
331 |
|
332 |
\* vLLM会提前预分配显存,因此无法探测最大显存使用情况。HF是指使用Huggingface Transformers库进行推理。
|
333 |
|