Update README.md
Browse files
README.md
CHANGED
@@ -17,17 +17,17 @@ inference: false
|
|
17 |
|
18 |
## Model Details
|
19 |
|
20 |
-
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.
|
21 |
|
22 |
|
23 |
## About GPTQ (from HF Blog)
|
24 |
|
25 |
-
Quantization methods usually belong
|
26 |
|
27 |
1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
|
28 |
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
|
29 |
|
30 |
-
GPTQ falls into the PTQ category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
|
31 |
|
32 |
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
|
33 |
|
|
|
17 |
|
18 |
## Model Details
|
19 |
|
20 |
+
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.
|
21 |
|
22 |
|
23 |
## About GPTQ (from HF Blog)
|
24 |
|
25 |
+
Quantization methods usually belong to one of two categories:
|
26 |
|
27 |
1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
|
28 |
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
|
29 |
|
30 |
+
GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
|
31 |
|
32 |
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
|
33 |
|