JustinLin610
commited on
Commit
•
ea53161
1
Parent(s):
8629f2a
Update README.md
Browse files
README.md
CHANGED
@@ -61,9 +61,11 @@ To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (t
|
|
61 |
We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
|
62 |
|
63 |
```bash
|
64 |
-
./llama-server -m qwen2-72b-instruct-q4_0.gguf
|
65 |
```
|
66 |
|
|
|
|
|
67 |
Then it is easy to access the deployed service with OpenAI API:
|
68 |
|
69 |
```python
|
@@ -91,7 +93,7 @@ If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the
|
|
91 |
-n 512 -co -i -if -f prompts/chat-with-qwen.txt \
|
92 |
--in-prefix "<|im_start|>user\n" \
|
93 |
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
|
94 |
-
-ngl
|
95 |
```
|
96 |
|
97 |
## Evaluation
|
|
|
61 |
We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
|
62 |
|
63 |
```bash
|
64 |
+
./llama-server -m qwen2-72b-instruct-q4_0.gguf -ngl 80 -fa
|
65 |
```
|
66 |
|
67 |
+
(Note: `-ngl 80` refers to offloading 80 layers to GPUs, and `-fa` refers to the use of flash attention.)
|
68 |
+
|
69 |
Then it is easy to access the deployed service with OpenAI API:
|
70 |
|
71 |
```python
|
|
|
93 |
-n 512 -co -i -if -f prompts/chat-with-qwen.txt \
|
94 |
--in-prefix "<|im_start|>user\n" \
|
95 |
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
|
96 |
+
-ngl 80 -fa
|
97 |
```
|
98 |
|
99 |
## Evaluation
|