mrm8488 commited on
Commit
7f75f62
1 Parent(s): f5cf7bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -17,17 +17,17 @@ inference: false
17
 
18
  ## Model Details
19
 
20
- Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model. Links to other models can be found in the index at the bottom.
21
 
22
 
23
  ## About GPTQ (from HF Blog)
24
 
25
- Quantization methods usually belong into one of two categories:
26
 
27
  1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
28
  2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
29
 
30
- GPTQ falls into the PTQ category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
31
 
32
  Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
33
 
 
17
 
18
  ## Model Details
19
 
20
+ Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.
21
 
22
 
23
  ## About GPTQ (from HF Blog)
24
 
25
+ Quantization methods usually belong to one of two categories:
26
 
27
  1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
28
  2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
29
 
30
+ GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
31
 
32
  Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
33