TheBloke commited on
Commit
757de72
1 Parent(s): 4db56ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -90
README.md CHANGED
@@ -42,7 +42,15 @@ quantized_by: TheBloke
42
  <!-- description start -->
43
  # Description
44
 
45
- This repo contains GPTQ model files for [Mistral AI_'s Mixtral 8X7B v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1).
 
 
 
 
 
 
 
 
46
 
47
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
48
 
@@ -66,22 +74,6 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
66
  <!-- prompt-template end -->
67
 
68
 
69
-
70
- <!-- README_GPTQ.md-compatible clients start -->
71
- ## Known compatible clients / servers
72
-
73
- GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.
74
-
75
- These GPTQ models are known to work in the following inference servers/webuis.
76
-
77
- - [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
78
- - [KoboldAI United](https://github.com/henk717/koboldai)
79
- - [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui)
80
- - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
81
-
82
- This may not be a complete list; if you know of others, please let me know!
83
- <!-- README_GPTQ.md-compatible clients end -->
84
-
85
  <!-- README_GPTQ.md-provided-files start -->
86
  ## Provided files, and GPTQ parameters
87
 
@@ -186,6 +178,8 @@ Note that using Git with HF repos is strongly discouraged. It will be much slowe
186
  <!-- README_GPTQ.md-text-generation-webui start -->
187
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
188
 
 
 
189
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
190
 
191
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
@@ -209,44 +203,6 @@ It is strongly recommended to use the text-generation-webui one-click-installers
209
 
210
  <!-- README_GPTQ.md-text-generation-webui end -->
211
 
212
- <!-- README_GPTQ.md-use-from-tgi start -->
213
- ## Serving this model from Text Generation Inference (TGI)
214
-
215
- It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
216
-
217
- Example Docker parameters:
218
-
219
- ```shell
220
- --model-id TheBloke/Mixtral-8x7B-v0.1-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
221
- ```
222
-
223
- Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):
224
-
225
- ```shell
226
- pip3 install huggingface-hub
227
- ```
228
-
229
- ```python
230
- from huggingface_hub import InferenceClient
231
-
232
- endpoint_url = "https://your-endpoint-url-here"
233
-
234
- prompt = "Tell me about AI"
235
- prompt_template=f'''{prompt}
236
- '''
237
-
238
- client = InferenceClient(endpoint_url)
239
- response = client.text_generation(prompt,
240
- max_new_tokens=128,
241
- do_sample=True,
242
- temperature=0.7,
243
- top_p=0.95,
244
- top_k=40,
245
- repetition_penalty=1.1)
246
-
247
- print(f"Model output: {response}")
248
- ```
249
- <!-- README_GPTQ.md-use-from-tgi end -->
250
  <!-- README_GPTQ.md-use-from-python start -->
251
  ## Python code example: inference from this GPTQ model
252
 
@@ -275,21 +231,28 @@ pip3 install .
275
  ### Example Python code
276
 
277
  ```python
278
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
279
 
280
  model_name_or_path = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
281
- # To use a different branch, change revision
282
- # For example: revision="gptq-4bit-128g-actorder_True"
283
- model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
284
- device_map="auto",
285
- trust_remote_code=False,
286
- revision="main")
287
 
288
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
  prompt = "Tell me about AI"
291
- prompt_template=f'''{prompt}
292
- '''
293
 
294
  print("\n\n*** Generate:")
295
 
@@ -297,34 +260,9 @@ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
297
  output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
298
  print(tokenizer.decode(output[0]))
299
 
300
- # Inference can also be done using transformers' pipeline
301
-
302
- print("*** Pipeline:")
303
- pipe = pipeline(
304
- "text-generation",
305
- model=model,
306
- tokenizer=tokenizer,
307
- max_new_tokens=512,
308
- do_sample=True,
309
- temperature=0.7,
310
- top_p=0.95,
311
- top_k=40,
312
- repetition_penalty=1.1
313
- )
314
-
315
- print(pipe(prompt_template)[0]['generated_text'])
316
  ```
317
  <!-- README_GPTQ.md-use-from-python end -->
318
 
319
- <!-- README_GPTQ.md-compatibility start -->
320
- ## Compatibility
321
-
322
- The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.
323
-
324
- [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama and Mistral models in 4-bit. Please see the Provided Files table above for per-file compatibility.
325
-
326
- For a list of clients/servers, please see "Known compatible clients / servers", above.
327
- <!-- README_GPTQ.md-compatibility end -->
328
 
329
  <!-- footer start -->
330
  <!-- 200823 -->
 
42
  <!-- description start -->
43
  # Description
44
 
45
+ This repo contains **EXPERIMENTAL** GPTQ model files for [Mistral AI_'s Mixtral 8X7B v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1).
46
+
47
+ ## Requires AutoGPTQ PR
48
+
49
+ These files were made with, and will currently only work with, this AutoGPTQ PR: https://github.com/LaaZa/AutoGPTQ/tree/Mixtral
50
+
51
+ To test, please build AutoGPTQ from source using that PR.
52
+
53
+ Updates for Transformers support are expected soon.
54
 
55
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
56
 
 
74
  <!-- prompt-template end -->
75
 
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  <!-- README_GPTQ.md-provided-files start -->
78
  ## Provided files, and GPTQ parameters
79
 
 
178
  <!-- README_GPTQ.md-text-generation-webui start -->
179
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
180
 
181
+ **NOTE**: This will only work with the AutoGPTQ loader, and only if you build AutoGPTQ from source using https://github.com/LaaZa/AutoGPTQ/tree/Mixtral
182
+
183
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
184
 
185
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 
203
 
204
  <!-- README_GPTQ.md-text-generation-webui end -->
205
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
  <!-- README_GPTQ.md-use-from-python start -->
207
  ## Python code example: inference from this GPTQ model
208
 
 
231
  ### Example Python code
232
 
233
  ```python
234
+ from transformers import AutoTokenizer
235
+ from auto_gptq import AutoGPTQForCausalLM
236
 
237
  model_name_or_path = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
 
 
 
 
 
 
238
 
239
+ model_name_or_path = args.model_dir
240
+ # To use a different branch, change revision
241
+ # For example: revision="gptq-4bit-32g-actorder_True"
242
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
243
+ model_basename="model",
244
+ use_safetensors=True,
245
+ trust_remote_code=False,
246
+ device="cuda:0",
247
+ use_triton=False,
248
+ disable_exllama=True,
249
+ disable_exllamav2=True,
250
+ quantize_config=None)
251
+
252
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True, trust_remote_code=False)
253
 
254
  prompt = "Tell me about AI"
255
+ prompt_template=f'''{prompt}'''
 
256
 
257
  print("\n\n*** Generate:")
258
 
 
260
  output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
261
  print(tokenizer.decode(output[0]))
262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
  ```
264
  <!-- README_GPTQ.md-use-from-python end -->
265
 
 
 
 
 
 
 
 
 
 
266
 
267
  <!-- footer start -->
268
  <!-- 200823 -->