arxiv:2310.08659

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

Published on Oct 12, 2023

· Submitted by

akhaliq on Oct 16, 2023

#3 Paper of the day

Upvote

Authors:

Yixiao Li ,

Yifan Yu ,

Chen Liang ,

Pengcheng He ,

Nikos Karampatziakis ,

Weizhu Chen ,

Abstract

Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves the generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. We will release our code.

View arXiv page View PDF Add to collection

Community

librarian-bot

Oct 19, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

admarcosai

Nov 28, 2023

If I understand this paper correctly, we can do quantized fintuning with LoftQ but the quantized weight we obtain will always be specific to the dataset that was used for the finetuning. We cannot train an adapter a1 on frozen model M and later train an adapter a2 on (M+a1) both frozen in let say 8bit or 4bit right

LoftQ

Paper author Nov 29, 2023

•

edited Nov 29, 2023

LoftQ is not task-specific. You can fine-tune any dataset with the same quantized model M and initial adapter a0. Your requirement of training an adapter a2 on (M+a1) is definitely feasible.

Moreover, LoftQ supports multi-task learning, which is the original motivation of LoRA. With, again, the same quantized model M and initial adapter a0, you can obtain many adapters a1, a2, ..., an, from different datasets, and plug each of them into the same quantized model M for deployment.