Papers
arxiv:2402.10193

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Published on Feb 15
· Submitted by akhaliq on Feb 16
Authors:
,
,

Abstract

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

Community

This comment has been hidden

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

So a 7B model would take 28GB VRAM for the base model plus 700MB for the delta. Great for data centers, but for consumers QLoRA, GPTQ, etc are far more efficient. I think most ordinary people will be happy to sacrifice a minor drop in accuracy for a huge drop in VRAM requirements.

can you combine this idea with this paper? https://huggingface.co/papers/2312.15166
you could probably compress that model quite a bit since it has 24 duplicated layers with (presumably) only a small change between the 2 copies

If it does work, then the idea could be applied to larger models without making them too expensive to run

Paper author
•
edited Feb 18

@timothelaborie Sounds interesting, will take a look! For a model with 32 layers, the 16 extra layers in the depth up-scaled model can be represented as 1-bit deltas. Main concern would be if they use a lot of data for continued pre-training - BitDelta tends to fail if the weight delta is too large.

Eg. this happened when we tried to compress mixtral experts, which are hypothesized to be continue-pretrained from Mistral 7B .

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.10193 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.10193 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.10193 in a Space README.md to link it from this page.

Collections including this paper 19