clibrain
/

Llama-2-13b-ft-instruct-es-gptq-4bit

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

mrm8488 commited on Aug 30, 2023

Commit

7f75f62

•

1 Parent(s): f5cf7bb

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -17,17 +17,17 @@ inference: false
 ## Model Details
-Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model. Links to other models can be found in the index at the bottom.
 ## About GPTQ (from HF Blog)
-Quantization methods usually belong into one of two categories:
 1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
 2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
-GPTQ falls into the PTQ category and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
 Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.

 ## Model Details
+Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.
 ## About GPTQ (from HF Blog)
+Quantization methods usually belong to one of two categories:
 1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
 2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
+GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.
 Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.