Update README.md
Browse files
README.md
CHANGED
@@ -108,13 +108,15 @@ Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
108 |
```
|
109 |
|
110 |
7. Run Docker container
|
|
|
111 |
```
|
112 |
-
docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama3-70B-PPO-Chat.nemo:/opt/checkpoints/Llama3-70B-PPO-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
113 |
```
|
|
|
114 |
8. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRT-LLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
|
115 |
|
116 |
```
|
117 |
-
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama3-70B-PPO-Chat.nemo --model_type="llama" --triton_model_name Llama3-70B-PPO-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
118 |
```
|
119 |
|
120 |
9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code
|
|
|
108 |
```
|
109 |
|
110 |
7. Run Docker container
|
111 |
+
(In addition, to use Llama3 tokenizer, you need to ```export HF_HOME=<YOUR_HF_HOME_CONTAINING_TOKEN_WITH_LLAMA3_70B_ACCESS>```)
|
112 |
```
|
113 |
+
docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama3-70B-PPO-Chat.nemo:/opt/checkpoints/Llama3-70B-PPO-Chat.nemo,${HF_HOME}:/hf_home -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
|
114 |
```
|
115 |
+
|
116 |
8. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRT-LLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
|
117 |
|
118 |
```
|
119 |
+
HF_HOME=/hf_home python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama3-70B-PPO-Chat.nemo --model_type="llama" --triton_model_name Llama3-70B-PPO-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
120 |
```
|
121 |
|
122 |
9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code
|