Abstract
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
Community
Super exciting paper!
With the large embedding models driving the AI development, techniques like this will play a major role to make it feasible to train much larger models and open up new research frontiers. Very exciting๐ค
Congratulations on this excellent paper. It gives not only the results of the study but also it's very informative. Thank you.
(/โใฎโ)/
Thank you, great read!
Here are my main takeaways:
- Innovations:
- 4-bit NormalFloat (NF4), a new datatype that is information theoretically optimal for normally distributed weights. This is used only for storage: the computation data type is still bf16, so for the forward and backward pass you de-quantize the storage data type.
- Double quantization by quantizing the quantization constants: when quantizing, you need to rescale your values by a constant C to make them fit into a certain range. Double quantization quantizes C, thus saves an average 0.37b per parameter, which is quite significant!
- Paged Optimizers to manage memory spikes, by using NVIDIA unified memory (transfers between GPU and CPU) to avoid gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.
- Effect:
- On compute:
- the memory cost is greatly reduced, at the cost of a small computational overhead
- On model accuracy: no degradation of performance.
- On compute:
- About bf16: this data type is brain float16, introduced by Google brain type, that differently manage mantissa and exponent bits to get fp32-level performance with the size of fp16.
- Hyperparameters used for finetuning experiments:
- โWe find LoRA r is unrelated to final performance if LoRA is used on all layersโ
- LR: 1e-4 or 2e-4, constant schedule.
- Batch size: 16 for models under 13B, 16 or 32 for 33B, 16-64 for 65B
- NF4 with double quantization and bf16 computation datatype.
- LoRA r = 64, ฮฑ = 16
- We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models.
- Target modules: โall linear layers of the base modelโ
- โuse group-by-length to group examples of similar lengths in the same
batch (note this will produce a oscillating loss curve)โ
- Question: the paper says โWe find that LoRA dropout 0.05 is useful for small models (7B, 13B), but not for larger models (33B, 65B).โ Then why use the opposite in the finetuning experiments?
QLoRA: Memory-Efficient Fine-tuning for Large Language Models
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/