adding in code changes that I did and necessary steps to reproduce inference on GPU using transformers tutorial
Browse files
README.md
CHANGED
@@ -83,10 +83,32 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
|
|
83 |
|
84 |
Specifically we tested BLOOM inference via command-line in this repository.
|
85 |
|
|
|
|
|
|
|
|
|
|
|
86 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
87 |
- Install transformers version 4.27.0
|
88 |
- `pip install transformers==4.27.0`
|
89 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
|
92 |
|
|
|
83 |
|
84 |
Specifically we tested BLOOM inference via command-line in this repository.
|
85 |
|
86 |
+
Running command:
|
87 |
+
```
|
88 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
89 |
+
```
|
90 |
+
|
91 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
92 |
- Install transformers version 4.27.0
|
93 |
- `pip install transformers==4.27.0`
|
94 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
95 |
+
- Modifying `inference_server/models/hf_accelerate.py`
|
96 |
+
- This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
|
97 |
+
|
98 |
+
Modifications for `inference_server/models/hf_accelerate.py`:
|
99 |
+
|
100 |
+
```python
|
101 |
+
from accelerate.utils.modeling import get_max_memory
|
102 |
+
...
|
103 |
+
class HFAccelerateModel(Model):
|
104 |
+
def __init__(self, args: Namespace) -> None:
|
105 |
+
...
|
106 |
+
original_max_memory_dict = get_max_memory()
|
107 |
+
|
108 |
+
reduce_max_memory_dict = {device_key: int(original_max_memory_dict[device_key] * 0.85) for device_key in original_max_memory_dict}
|
109 |
+
|
110 |
+
kwargs["max_memory"] = reduce_max_memory_dict
|
111 |
+
```
|
112 |
|
113 |
|
114 |
|