arxiv:2402.10193

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Published on Feb 15

· Submitted by

akhaliq on Feb 16

Upvote

Authors:

James Liu ,

Guangxuan Xiao ,

Song Han ,

Tri Dao ,

Tianle Cai

Abstract

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

View arXiv page View PDF Add to collection

Community

zandrrlife

Feb 16

This comment has been hidden

librarian-bot

Feb 17

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

MichaelBarryUK

Feb 17

So a 7B model would take 28GB VRAM for the base model plus 700MB for the delta. Great for data centers, but for consumers QLoRA, GPTQ, etc are far more efficient. I think most ordinary people will be happy to sacrifice a minor drop in accuracy for a huge drop in VRAM requirements.

timothelaborie

Feb 17

•

edited Feb 17

can you combine this idea with this paper? https://huggingface.co/papers/2312.15166
you could probably compress that model quite a bit since it has 24 duplicated layers with (presumably) only a small change between the 2 copies

If it does work, then the idea could be applied to larger models without making them too expensive to run

jamesliu1

Paper author Feb 18

•

edited Feb 18

@timothelaborie Sounds interesting, will take a look! For a model with 32 layers, the 16 extra layers in the depth up-scaled model can be represented as 1-bit deltas. Main concern would be if they use a lot of data for continued pre-training - BitDelta tends to fail if the weight delta is too large.

Eg. this happened when we tried to compress mixtral experts, which are hypothesized to be continue-pretrained from Mistral 7B .