Lin-K76 commited on
Commit
f5543ba
1 Parent(s): 24983fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -43
README.md CHANGED
@@ -17,13 +17,13 @@ license_link: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/
17
  - **Activation quantization:** FP8
18
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
19
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
20
- - **Release Date:** 6/29/2024
21
- - **Version:** 1.0
22
  - **License(s):** [mit](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/LICENSE)
23
  - **Model Developers:** Neural Magic
24
 
25
- Quantized version of [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct).
26
- It achieves an average score of 68.99 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 69.13.
27
 
28
  ### Model Optimizations
29
 
@@ -68,32 +68,93 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
68
 
69
  ## Creation
70
 
71
- This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
72
- Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
73
 
74
  ```python
 
75
  from datasets import load_dataset
76
  from transformers import AutoTokenizer
77
 
78
- from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
79
-
80
- pretrained_model_dir = "microsoft/Phi-3-mini-128k-instruct"
81
- quantized_model_dir = "Phi-3-mini-128k-instruct-FP8"
82
-
83
- tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
84
- tokenizer.pad_token = tokenizer.eos_token
85
-
86
- ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
87
- examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
88
- examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
89
 
90
- quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- model = AutoFP8ForCausalLM.from_pretrained(
93
- pretrained_model_dir, quantize_config=quantize_config
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  )
95
- model.quantize(examples)
96
- model.save_quantized(quantized_model_dir)
97
  ```
98
 
99
  ## Evaluation
@@ -124,71 +185,71 @@ lm_eval \
124
  <tr>
125
  <td>MMLU (5-shot)
126
  </td>
127
- <td>68.10
128
  </td>
129
- <td>67.93
130
  </td>
131
- <td>99.75%
132
  </td>
133
  </tr>
134
  <tr>
135
  <td>ARC Challenge (25-shot)
136
  </td>
137
- <td>63.65
138
  </td>
139
- <td>64.24
140
  </td>
141
- <td>100.9%
142
  </td>
143
  </tr>
144
  <tr>
145
  <td>GSM-8K (5-shot, strict-match)
146
  </td>
147
- <td>75.59
148
  </td>
149
- <td>74.37
150
  </td>
151
- <td>98.38%
152
  </td>
153
  </tr>
154
  <tr>
155
  <td>Hellaswag (10-shot)
156
  </td>
157
- <td>79.76
158
  </td>
159
- <td>79.79
160
  </td>
161
- <td>100.0%
162
  </td>
163
  </tr>
164
  <tr>
165
  <td>Winogrande (5-shot)
166
  </td>
167
- <td>73.72
168
  </td>
169
- <td>74.11
170
  </td>
171
- <td>100.5%
172
  </td>
173
  </tr>
174
  <tr>
175
  <td>TruthfulQA (0-shot)
176
  </td>
177
- <td>53.97
178
  </td>
179
- <td>53.50
180
  </td>
181
- <td>99.12%
182
  </td>
183
  </tr>
184
  <tr>
185
  <td><strong>Average</strong>
186
  </td>
187
- <td><strong>69.13</strong>
188
  </td>
189
- <td><strong>68.99</strong>
190
  </td>
191
- <td><strong>99.80%</strong>
192
  </td>
193
  </tr>
194
  </table>
 
17
  - **Activation quantization:** FP8
18
  - **Intended Use Cases:** Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
19
  - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
20
+ - **Release Date:** 8/11/2024
21
+ - **Version:** 1.1
22
  - **License(s):** [mit](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/LICENSE)
23
  - **Model Developers:** Neural Magic
24
 
25
+ Quantized version of [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct), with the new configuration files.
26
+ It achieves an average score of 69.42 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 69.69.
27
 
28
  ### Model Optimizations
29
 
 
68
 
69
  ## Creation
70
 
71
+ This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
 
72
 
73
  ```python
74
+ import torch
75
  from datasets import load_dataset
76
  from transformers import AutoTokenizer
77
 
78
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
79
+ from llmcompressor.transformers.compression.helpers import (
80
+ calculate_offload_device_map,
81
+ custom_offload_device_map,
82
+ )
 
 
 
 
 
 
83
 
84
+ recipe = """
85
+ quant_stage:
86
+ quant_modifiers:
87
+ QuantizationModifier:
88
+ ignore: ["lm_head"]
89
+ config_groups:
90
+ group_0:
91
+ weights:
92
+ num_bits: 8
93
+ type: float
94
+ strategy: tensor
95
+ dynamic: false
96
+ symmetric: true
97
+ input_activations:
98
+ num_bits: 8
99
+ type: float
100
+ strategy: tensor
101
+ dynamic: false
102
+ symmetric: true
103
+ targets: ["Linear"]
104
+ """
105
+
106
+ model_stub = "microsoft/Phi-3-mini-128k-instruct"
107
+ model_name = model_stub.split("/")[-1]
108
+
109
+ device_map = calculate_offload_device_map(
110
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
111
+ )
112
 
113
+ model = SparseAutoModelForCausalLM.from_pretrained(
114
+ model_stub, torch_dtype=torch.float16, device_map=device_map
115
+ )
116
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
117
+
118
+ output_dir = f"./{model_name}-FP8"
119
+
120
+ DATASET_ID = "HuggingFaceH4/ultrachat_200k"
121
+ DATASET_SPLIT = "train_sft"
122
+ NUM_CALIBRATION_SAMPLES = 512
123
+ MAX_SEQUENCE_LENGTH = 4096
124
+
125
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
126
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
127
+
128
+ def preprocess(example):
129
+ return {
130
+ "text": tokenizer.apply_chat_template(
131
+ example["messages"],
132
+ tokenize=False,
133
+ )
134
+ }
135
+
136
+ ds = ds.map(preprocess)
137
+
138
+ def tokenize(sample):
139
+ return tokenizer(
140
+ sample["text"],
141
+ padding=False,
142
+ max_length=MAX_SEQUENCE_LENGTH,
143
+ truncation=True,
144
+ add_special_tokens=False,
145
+ )
146
+
147
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
148
+
149
+ oneshot(
150
+ model=model,
151
+ output_dir=output_dir,
152
+ dataset=ds,
153
+ recipe=recipe,
154
+ max_seq_length=MAX_SEQUENCE_LENGTH,
155
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
156
+ save_compressed=True,
157
  )
 
 
158
  ```
159
 
160
  ## Evaluation
 
185
  <tr>
186
  <td>MMLU (5-shot)
187
  </td>
188
+ <td>69.33
189
  </td>
190
+ <td>68.90
191
  </td>
192
+ <td>99.38%
193
  </td>
194
  </tr>
195
  <tr>
196
  <td>ARC Challenge (25-shot)
197
  </td>
198
+ <td>63.05
199
  </td>
200
+ <td>63.05
201
  </td>
202
+ <td>100.0%
203
  </td>
204
  </tr>
205
  <tr>
206
  <td>GSM-8K (5-shot, strict-match)
207
  </td>
208
+ <td>76.95
209
  </td>
210
+ <td>76.27
211
  </td>
212
+ <td>99.12%
213
  </td>
214
  </tr>
215
  <tr>
216
  <td>Hellaswag (10-shot)
217
  </td>
218
+ <td>79.58
219
  </td>
220
+ <td>79.36
221
  </td>
222
+ <td>99.72%
223
  </td>
224
  </tr>
225
  <tr>
226
  <td>Winogrande (5-shot)
227
  </td>
228
+ <td>74.82
229
  </td>
230
+ <td>74.59
231
  </td>
232
+ <td>99.69%
233
  </td>
234
  </tr>
235
  <tr>
236
  <td>TruthfulQA (0-shot)
237
  </td>
238
+ <td>54.41
239
  </td>
240
+ <td>54.36
241
  </td>
242
+ <td>99.91%
243
  </td>
244
  </tr>
245
  <tr>
246
  <td><strong>Average</strong>
247
  </td>
248
+ <td><strong>69.69</strong>
249
  </td>
250
+ <td><strong>69.42</strong>
251
  </td>
252
+ <td><strong>99.61%</strong>
253
  </td>
254
  </tr>
255
  </table>