Update README.md
Browse files
README.md
CHANGED
@@ -80,12 +80,12 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
80 |
docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
81 |
```
|
82 |
|
83 |
-
|
84 |
```
|
85 |
git lfs install
|
86 |
git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
|
87 |
```
|
88 |
-
|
89 |
```
|
90 |
cd Llama2-70B-SteerLM-Chat/Llama2-70B-SteerLM-Chat
|
91 |
tar -cvf Llama2-70B-SteerLM-Chat.nemo .
|
@@ -94,17 +94,17 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
94 |
rm -r Llama2-70B-SteerLM-Chat
|
95 |
```
|
96 |
|
97 |
-
|
98 |
```
|
99 |
docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
100 |
```
|
101 |
-
|
102 |
|
103 |
```
|
104 |
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
105 |
```
|
106 |
|
107 |
-
|
108 |
|
109 |
```
|
110 |
Started HTTPService at 0.0.0.0:8000
|
@@ -134,9 +134,10 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
134 |
output = output[0][0].split("\n<extra_id_1>")[0]
|
135 |
print(output)
|
136 |
```
|
137 |
-
|
|
|
138 |
|
139 |
-
Single Turn
|
140 |
```
|
141 |
<extra_id_0>System
|
142 |
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
@@ -146,7 +147,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
146 |
<extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
|
147 |
```
|
148 |
|
149 |
-
Multi-Turn
|
150 |
```
|
151 |
<extra_id_0>System
|
152 |
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
|
|
80 |
docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
81 |
```
|
82 |
|
83 |
+
5. Download the checkpoint
|
84 |
```
|
85 |
git lfs install
|
86 |
git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
|
87 |
```
|
88 |
+
6. Convert checkpoint into nemo format
|
89 |
```
|
90 |
cd Llama2-70B-SteerLM-Chat/Llama2-70B-SteerLM-Chat
|
91 |
tar -cvf Llama2-70B-SteerLM-Chat.nemo .
|
|
|
94 |
rm -r Llama2-70B-SteerLM-Chat
|
95 |
```
|
96 |
|
97 |
+
7. Run Docker container
|
98 |
```
|
99 |
docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
100 |
```
|
101 |
+
8. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
|
102 |
|
103 |
```
|
104 |
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
105 |
```
|
106 |
|
107 |
+
9. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
|
108 |
|
109 |
```
|
110 |
Started HTTPService at 0.0.0.0:8000
|
|
|
134 |
output = output[0][0].split("\n<extra_id_1>")[0]
|
135 |
print(output)
|
136 |
```
|
137 |
+
10. If you would support multi-turn conversations or adjust attribute values at inference time, here is some guidance:
|
138 |
+
|
139 |
|
140 |
+
Default template for Single Turn
|
141 |
```
|
142 |
<extra_id_0>System
|
143 |
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
|
|
147 |
<extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
|
148 |
```
|
149 |
|
150 |
+
Default template for Multi-Turn
|
151 |
```
|
152 |
<extra_id_0>System
|
153 |
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|