|
--- |
|
license: apache-2.0 |
|
language: |
|
- es |
|
tags: |
|
- quantization |
|
- gptq |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
inference: false |
|
--- |
|
|
|
# Llama-2-13b-ft-instruct-es-gptq-4bit |
|
|
|
[Llama 2 (13B)](https://huggingface.co/meta-llama/Llama-2-13b) fine-tuned on [Clibrain](https://huggingface.co/clibrain)'s Spanish instructions dataset and **optimized** using **GPTQ**. |
|
|
|
|
|
## Model Details |
|
|
|
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B pre-trained model. |
|
|
|
|
|
##聽About GPTQ (from HF Blog) |
|
|
|
Quantization methods usually belong to one of two categories: |
|
|
|
1. Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, such as a calibration dataset and a few hours of computation. |
|
2. Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning. |
|
|
|
GPTQ falls into the PTQ category, and this is particularly interesting for massive models, for which full model training or even fine-tuning can be very expensive. |
|
|
|
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16. |
|
|
|
The benefits of this scheme are twofold: |
|
|
|
- Memory savings close to x4 for int4 quantization, as the dequantization happens close to the compute unit in a fused kernel, and not in the GPU global memory. |
|
- Potential speedups thanks to the time saved on data communication due to the lower bitwidth used for weights. |
|
|
|
The GPTQ paper tackles the layer-wise compression problem: |
|
|
|
Given a layer \\(l\\) with weight matrix \\(W_{l}\\) and layer input \\(X_{l}\\), we want to find a quantized version of the weight \\(\hat{W}_{l}\\) to minimize the mean squared error (MSE): |
|
|
|
|
|
\\({\hat{W}_{l}}^{*} = argmin_{\hat{W_{l}}} \|W_{l}X-\hat{W}_{l}X\|^{2}_{2}\\) |
|
|
|
Once this is solved per layer, a solution to the global problem can be obtained by combining the layer-wise solutions. |
|
|
|
In order to solve this layer-wise compression problem, the author uses the Optimal Brain Quantization framework ([Frantar et al 2022](https://arxiv.org/abs/2208.11580)). The OBQ method starts from the observation that the above equation can be written as the sum of the squared errors, over each row of \\(W_{l}\\). |
|
|
|
|
|
\\( \sum_{i=0}^{d_{row}} \|W_{l[i,:]}X-\hat{W}_{l[i,:]}X\|^{2}_{2} \\) |
|
|
|
This means that we can quantize each row independently. This is called per-channel quantization. For each row \\(W_{l[i,:]}\\), OBQ quantizes one weight at a time while always updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing a single weight. The update on selected weights has a closed-form formula, utilizing Hessian matrices. |
|
|
|
The GPTQ paper improves this framework by introducing a set of optimizations that reduces the complexity of the quantization algorithm while retaining the accuracy of the model. |
|
|
|
Compared to OBQ, the quantization step itself is also faster with GPTQ: it takes 2 GPU-hours to quantize a BERT model (336M) with OBQ, whereas with GPTQ, a Bloom model (176B) can be quantized in less than 4 GPU-hours. |
|
|
|
To learn more about the exact algorithm and the different benchmarks on perplexity and speedups, check out the original [paper](https://arxiv.org/pdf/2210.17323.pdf). |
|
|
|
## Example of Usage |
|
|
|
```sh |
|
pip install transformers accelerate optimum |
|
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/ |
|
``` |
|
|
|
```py |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
|
|
model_id = "clibrain/Llama-2-13b-ft-instruct-es-gptq-4bit" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
def create_instruction(instruction, input_data=None, context=None): |
|
sections = { |
|
"Instrucci贸n": instruction, |
|
"Entrada": input_data, |
|
"Contexto": context, |
|
} |
|
|
|
system_prompt = "A continuaci贸n hay una instrucci贸n que describe una tarea, junto con una entrada que proporciona m谩s contexto. Escriba una respuesta que complete adecuadamente la solicitud.\n\n" |
|
prompt = system_prompt |
|
|
|
for title, content in sections.items(): |
|
if content is not None: |
|
prompt += f"### {title}:\n{content}\n\n" |
|
|
|
prompt += "### Respuesta:\n" |
|
|
|
return prompt |
|
|
|
|
|
def generate( |
|
instruction, |
|
input=None, |
|
context=None, |
|
max_new_tokens=128, |
|
temperature=0.1, |
|
top_p=0.75, |
|
top_k=40, |
|
num_beams=4, |
|
**kwargs |
|
): |
|
|
|
prompt = create_instruction(instruction, input, context) |
|
print(prompt.replace("### Respuesta:\n", "")) |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
input_ids = inputs["input_ids"].to("cuda") |
|
attention_mask = inputs["attention_mask"].to("cuda") |
|
generation_config = GenerationConfig( |
|
temperature=temperature, |
|
top_p=top_p, |
|
top_k=top_k, |
|
num_beams=num_beams, |
|
**kwargs, |
|
) |
|
with torch.no_grad(): |
|
generation_output = model.generate( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
generation_config=generation_config, |
|
return_dict_in_generate=True, |
|
output_scores=True, |
|
max_new_tokens=max_new_tokens, |
|
early_stopping=True |
|
) |
|
s = generation_output.sequences[0] |
|
output = tokenizer.decode(s) |
|
return output.split("### Respuesta:")[1].lstrip("\n") |
|
|
|
instruction = "Dame una lista de lugares a visitar en Espa帽a." |
|
print(generate(instruction)) |
|
``` |
|
|
|
### Performance Test |
|
|
|
After several executions on a *Nvidia T4 with 16GB VRAM*, we got: |
|
|
|
| Latency | GPU Mem Required | |
|
----------|---------| |
|
|49.39 ms/token | 7.06 GB | |