File size: 5,947 Bytes

---
license: apache-2.0
language:
- es
tags:
- quantization
- gptq
pipeline_tag: text-generation
library_name: transformers
inference: false
---

# Llama-2-13b-ft-instruct-es-gptq-4bit

[Llama 2 (13B)](https://huggingface.co/meta-llama/Llama-2-13b) fine-tuned on [Clibrain](https://huggingface.co/clibrain)'s Spanish instructions dataset and **optimized** using **GPTQ**.


## Model Details

Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model.


## About GPTQ (from HF Blog)

Quantization methods usually belong to one of two categories: 

1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation.
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning. 

GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive.

Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.

The benefits of this scheme are twofold:

- Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory.
- Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights.

The GPTQ paper tackles the layer-wise compression problem: 

Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), we want to find a quantized version of the weight \\(\hat{W}_{l}\\) to minimize the mean squared error (MSE):


\\({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} \|W_{l}X-\hat{W}_{l}X\|^{2}_{2}\\)

Once this is solved per layer, a solution to the global problem can be obtained by combining the layer-wise solutions. 

In order to solve this layer-wise compression problem, the author uses the Optimal Brain Quantization framework ([Frantar et al 2022](https://arxiv.org/abs/2208.11580)). The OBQ method starts from the observation that the above equation can be written as the sum of the squared errors, over each row of \\(W_{l}\\).


\\( \sum_{i=0}^{d_{row}} \|W_{l[i,:]}X-\hat{W}_{l[i,:]}X\|^{2}_{2} \\)

This means that we can quantize each row independently. This is called per-channel quantization. For each row \\(W_{l[i,:]}\\), OBQ quantizes one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight. The update on selected weights has a closed-form formula, utilizing Hessian matrices. 

The GPTQ paper improves this framework by introducing a set of optimizations that reduces the complexity of the quantization algorithm while retaining the accuracy of the model.

Compared to OBQ, the quantization step itself is also faster with GPTQ: it takes 2 GPU-hours to quantize a BERT model (336M) with OBQ, whereas with GPTQ, a Bloom model (176B) can be quantized in less than 4 GPU-hours. 

To learn more about the exact algorithm and the different benchmarks on perplexity and speedups, check out the original [paper](https://arxiv.org/pdf/2210.17323.pdf).

## Example of Usage

```sh
pip install transformers accelerate optimum
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/
```

```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_id = "clibrain/Llama-2-13b-ft-instruct-es-gptq-4bit"

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

def create_instruction(instruction, input_data=None, context=None):
    sections = {
        "Instrucción": instruction,
        "Entrada": input_data,
        "Contexto": context,
    }

    system_prompt = "A continuación hay una instrucción que describe una tarea, junto con una entrada que proporciona más contexto. Escriba una respuesta que complete adecuadamente la solicitud.\n\n"
    prompt = system_prompt

    for title, content in sections.items():
        if content is not None:
            prompt += f"### {title}:\n{content}\n\n"

    prompt += "### Respuesta:\n"

    return prompt


def generate(
        instruction,
        input=None,
        context=None,
        max_new_tokens=128,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=4,
        **kwargs
):
    
    prompt = create_instruction(instruction, input, context)
    print(prompt.replace("### Respuesta:\n", ""))
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    attention_mask = inputs["attention_mask"].to("cuda")
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
            early_stopping=True
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Respuesta:")[1].lstrip("\n")

instruction = "Dame una lista de lugares a visitar en España."
print(generate(instruction))
```

### Performance Test

After several executions on a *Nvidia T4 with 16GB VRAM*, we got:

| Latency | GPU Mem Required |
----------|---------|
|49.39 ms/token | 7.06 GB |