jayr014 commited on
Commit
05e0307
1 Parent(s): d4706b6

adding in code changes that I did and necessary steps to reproduce inference on GPU using transformers tutorial

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -83,10 +83,32 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
83
 
84
  Specifically we tested BLOOM inference via command-line in this repository.
85
 
 
 
 
 
 
86
  NOTE: Things that we had to modify in order for BLOOMChat to work:
87
  - Install transformers version 4.27.0
88
  - `pip install transformers==4.27.0`
89
  - Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
 
92
 
 
83
 
84
  Specifically we tested BLOOM inference via command-line in this repository.
85
 
86
+ Running command:
87
+ ```
88
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
89
+ ```
90
+
91
  NOTE: Things that we had to modify in order for BLOOMChat to work:
92
  - Install transformers version 4.27.0
93
  - `pip install transformers==4.27.0`
94
  - Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
95
+ - Modifying `inference_server/models/hf_accelerate.py`
96
+ - This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
97
+
98
+ Modifications for `inference_server/models/hf_accelerate.py`:
99
+
100
+ ```python
101
+ from accelerate.utils.modeling import get_max_memory
102
+ ...
103
+ class HFAccelerateModel(Model):
104
+ def __init__(self, args: Namespace) -> None:
105
+ ...
106
+ original_max_memory_dict = get_max_memory()
107
+
108
+ reduce_max_memory_dict = {device_key: int(original_max_memory_dict[device_key] * 0.85) for device_key in original_max_memory_dict}
109
+
110
+ kwargs["max_memory"] = reduce_max_memory_dict
111
+ ```
112
 
113
 
114