Qwen
/

Qwen2-72B-Instruct-GGUF

Text Generation

Inference Endpoints

Model card Files Files and versions Community

JustinLin610 commited on Jun 17

Commit

ea53161

•

1 Parent(s): 8629f2a

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -61,9 +61,11 @@ To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (t
 We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
 ```bash
-./llama-server -m qwen2-72b-instruct-q4_0.gguf
 ```
 Then it is easy to access the deployed service with OpenAI API:
 ```python
@@ -91,7 +93,7 @@ If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the
   -n 512 -co -i -if -f prompts/chat-with-qwen.txt \
   --in-prefix "<|im_start|>user\n" \
   --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
-  -ngl 28 -fa
 ```
 ## Evaluation

 We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
 ```bash
+./llama-server -m qwen2-72b-instruct-q4_0.gguf -ngl 80 -fa
 ```
+(Note: `-ngl 80` refers to offloading 80 layers to GPUs, and `-fa` refers to the use of flash attention.)
 Then it is easy to access the deployed service with OpenAI API:
 ```python
   -n 512 -co -i -if -f prompts/chat-with-qwen.txt \
   --in-prefix "<|im_start|>user\n" \
   --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
+  -ngl 80 -fa
 ```
 ## Evaluation