Lin-K76 commited on
Commit
e8570ed
1 Parent(s): 9e78a53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -62
README.md CHANGED
@@ -17,13 +17,13 @@ license_link: https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/resolv
17
  - **Activation quantization:** FP8
18
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
19
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
20
- - **Release Date:** 6/29/2024
21
- - **Version:** 1.0
22
  - **License(s):** [mit](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/resolve/main/LICENSE)
23
  - **Model Developers:** Neural Magic
24
 
25
- Quantized version of [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct).
26
- It achieves an average score of 73.20 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.47.
27
 
28
  ### Model Optimizations
29
 
@@ -68,50 +68,94 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
68
 
69
  ## Creation
70
 
71
- This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
72
- Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
73
 
74
  ```python
 
75
  from datasets import load_dataset
76
  from transformers import AutoTokenizer
77
- import numpy as np
78
- import torch
79
-
80
- from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
81
-
82
- MODEL_DIR = "microsoft/Phi-3-medium-128k-instruct"
83
- final_model_dir = MODEL_DIR.split("/")[-1]
84
-
85
- CONTEXT_LENGTH = 4096
86
- NUM_SAMPLES = 512
87
- NUM_REPEATS = 10
88
-
89
- pretrained_model_dir = MODEL_DIR
90
- tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=CONTEXT_LENGTH)
91
- tokenizer.pad_token = tokenizer.eos_token
92
-
93
- tokenizer_num_tokens = len(list(tokenizer.get_vocab().values()))
94
- total_token_samples = NUM_REPEATS * tokenizer_num_tokens
95
- num_random_samp = -(-total_token_samples // CONTEXT_LENGTH)
96
 
97
- input_ids = np.tile(np.arange(tokenizer_num_tokens), NUM_REPEATS + 1)[:num_random_samp * CONTEXT_LENGTH]
98
- np.random.shuffle(input_ids)
99
- input_ids = input_ids.reshape(num_random_samp, CONTEXT_LENGTH)
100
- input_ids = torch.tensor(input_ids, dtype=torch.int64).to("cuda")
101
-
102
- quantize_config = BaseQuantizeConfig(
103
- quant_method="fp8",
104
- activation_scheme="static",
105
  )
106
 
107
- examples = input_ids
108
-
109
- model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config=quantize_config)
110
-
111
- model.quantize(examples)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- quantized_model_dir = f"{final_model_dir}-FP8"
114
- model.save_quantized(quantized_model_dir)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ```
116
 
117
  ## Evaluation
@@ -120,7 +164,7 @@ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-
120
  ```
121
  lm_eval \
122
  --model vllm \
123
- --model_args pretrained="neuralmagic/Phi-3-medium-128k-instruct-FP8",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
124
  --tasks openllm \
125
  --batch_size auto
126
  ```
@@ -142,71 +186,71 @@ lm_eval \
142
  <tr>
143
  <td>MMLU (5-shot)
144
  </td>
145
- <td>75.73
146
  </td>
147
- <td>75.58
148
  </td>
149
- <td>99.80%
150
  </td>
151
  </tr>
152
  <tr>
153
  <td>ARC Challenge (25-shot)
154
  </td>
155
- <td>67.66
156
  </td>
157
- <td>66.89
158
  </td>
159
- <td>98.86%
160
  </td>
161
  </tr>
162
  <tr>
163
  <td>GSM-8K (5-shot, strict-match)
164
  </td>
165
- <td>84.00
166
  </td>
167
- <td>82.71
168
  </td>
169
- <td>98.46%
170
  </td>
171
  </tr>
172
  <tr>
173
  <td>Hellaswag (10-shot)
174
  </td>
175
- <td>84.37
176
  </td>
177
- <td>84.14
178
  </td>
179
- <td>99.72%
180
  </td>
181
  </tr>
182
  <tr>
183
  <td>Winogrande (5-shot)
184
  </td>
185
- <td>75.53
186
  </td>
187
- <td>74.35
188
  </td>
189
- <td>98.43%
190
  </td>
191
  </tr>
192
  <tr>
193
  <td>TruthfulQA (0-shot)
194
  </td>
195
- <td>53.53
196
  </td>
197
- <td>55.52
198
  </td>
199
- <td>103.7%
200
  </td>
201
  </tr>
202
  <tr>
203
  <td><strong>Average</strong>
204
  </td>
205
- <td><strong>73.47</strong>
206
  </td>
207
- <td><strong>73.20</strong>
208
  </td>
209
- <td><strong>99.63%</strong>
210
  </td>
211
  </tr>
212
  </table>
 
17
  - **Activation quantization:** FP8
18
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
19
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
20
+ - **Release Date:** 8/12/2024
21
+ - **Version:** 1.1
22
  - **License(s):** [mit](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/resolve/main/LICENSE)
23
  - **Model Developers:** Neural Magic
24
 
25
+ Quantized version of [Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct), with the new configuration files.
26
+ It achieves an average score of 73.65 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 73.95.
27
 
28
  ### Model Optimizations
29
 
 
68
 
69
  ## Creation
70
 
71
+ This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
72
+ Importantly, the "rope_scaling" type in config.json was manually changed from "longrope" to "su" following quantization.
73
 
74
  ```python
75
+ import torch
76
  from datasets import load_dataset
77
  from transformers import AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
80
+ from llmcompressor.transformers.compression.helpers import (
81
+ calculate_offload_device_map,
82
+ custom_offload_device_map,
 
 
 
 
83
  )
84
 
85
+ recipe = """
86
+ quant_stage:
87
+ quant_modifiers:
88
+ QuantizationModifier:
89
+ ignore: ["lm_head"]
90
+ config_groups:
91
+ group_0:
92
+ weights:
93
+ num_bits: 8
94
+ type: float
95
+ strategy: tensor
96
+ dynamic: false
97
+ symmetric: true
98
+ input_activations:
99
+ num_bits: 8
100
+ type: float
101
+ strategy: tensor
102
+ dynamic: false
103
+ symmetric: true
104
+ targets: ["Linear"]
105
+ """
106
+
107
+ model_stub = "microsoft/Phi-3-medium-128k-instruct"
108
+ model_name = model_stub.split("/")[-1]
109
+
110
+ device_map = calculate_offload_device_map(
111
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
112
+ )
113
 
114
+ model = SparseAutoModelForCausalLM.from_pretrained(
115
+ model_stub, torch_dtype=torch.float16, device_map=device_map
116
+ )
117
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
118
+
119
+ output_dir = f"./{model_name}-FP8"
120
+
121
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k"
122
+ DATASET_SPLIT = "train_sft"
123
+ NUM_CALIBRATION_SAMPLES = 512
124
+ MAX_SEQUENCE_LENGTH = 4096
125
+
126
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
127
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
128
+
129
+ def preprocess(example):
130
+ return {
131
+ "text": tokenizer.apply_chat_template(
132
+ example["messages"],
133
+ tokenize=False,
134
+ )
135
+ }
136
+
137
+ ds = ds.map(preprocess)
138
+
139
+ def tokenize(sample):
140
+ return tokenizer(
141
+ sample["text"],
142
+ padding=False,
143
+ max_length=MAX_SEQUENCE_LENGTH,
144
+ truncation=True,
145
+ add_special_tokens=False,
146
+ )
147
+
148
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
149
+
150
+ oneshot(
151
+ model=model,
152
+ output_dir=output_dir,
153
+ dataset=ds,
154
+ recipe=recipe,
155
+ max_seq_length=MAX_SEQUENCE_LENGTH,
156
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
157
+ save_compressed=True,
158
+ )
159
  ```
160
 
161
  ## Evaluation
 
164
  ```
165
  lm_eval \
166
  --model vllm \
167
+ --model_args pretrained="neuralmagic/Phi-3-medium-128k-instruct-FP8",dtype=auto,gpu_memory_utilization=0.7,add_bos_token=True,max_model_len=4096 \
168
  --tasks openllm \
169
  --batch_size auto
170
  ```
 
186
  <tr>
187
  <td>MMLU (5-shot)
188
  </td>
189
+ <td>76.53
190
  </td>
191
+ <td>76.66
192
  </td>
193
+ <td>100.1%
194
  </td>
195
  </tr>
196
  <tr>
197
  <td>ARC Challenge (25-shot)
198
  </td>
199
+ <td>68.17
200
  </td>
201
+ <td>67.06
202
  </td>
203
+ <td>98.37%
204
  </td>
205
  </tr>
206
  <tr>
207
  <td>GSM-8K (5-shot, strict-match)
208
  </td>
209
+ <td>84.46
210
  </td>
211
+ <td>84.31
212
  </td>
213
+ <td>99.82%
214
  </td>
215
  </tr>
216
  <tr>
217
  <td>Hellaswag (10-shot)
218
  </td>
219
+ <td>84.77
220
  </td>
221
+ <td>84.63
222
  </td>
223
+ <td>99.83%
224
  </td>
225
  </tr>
226
  <tr>
227
  <td>Winogrande (5-shot)
228
  </td>
229
+ <td>75.22
230
  </td>
231
+ <td>74.51
232
  </td>
233
+ <td>99.06%
234
  </td>
235
  </tr>
236
  <tr>
237
  <td>TruthfulQA (0-shot)
238
  </td>
239
+ <td>54.52
240
  </td>
241
+ <td>54.71
242
  </td>
243
+ <td>100.35%
244
  </td>
245
  </tr>
246
  <tr>
247
  <td><strong>Average</strong>
248
  </td>
249
+ <td><strong>73.95</strong>
250
  </td>
251
+ <td><strong>73.65</strong>
252
  </td>
253
+ <td><strong>99.60%</strong>
254
  </td>
255
  </tr>
256
  </table>