arxiv:2310.11453

BitNet: Scaling 1-bit Transformers for Large Language Models

Published on Oct 17, 2023

· Submitted by

akhaliq on Oct 18, 2023

#1 Paper of the day

Upvote

Authors:

Hongyu Wang ,

Shuming Ma ,

Li Dong ,

Shaohan Huang ,

Huaijie Wang ,

Ruiping Wang ,

Furu Wei

Abstract

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

View arXiv page View PDF Add to collection

Community

MichaelBarryUK

Oct 18, 2023

•

edited Oct 18, 2023

Holy. Mother. Of. God.

This changes everything.

If this scales we are looking at 180B models on a 3090

Or a 40B model on an Iphone

What's next? Multiple parameters per bit...? Sounds impossible, but we do it with JPG.

puffy310

Oct 18, 2023

Can you simplify any further????

MichaelBarryUK

Oct 18, 2023

Can you simplify any further????

Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.

They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)

So it's 2 billion times less precise, and yet it retains 99% of the accuracy

It is 32 times smaller, 32 times faster, 32 times cheaper.

If it used to cost $320,000 to train, it now costs $10,000 to train

If it used to require 32 GPU's, it now requires 1 GPU

puffy310

Oct 18, 2023

Can you simplify any further????

Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.

They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)

So it's 2 billion times less precise, and yet it retains 99% of the accuracy

It is 32 times smaller, 32 times faster, 32 times cheaper.

If it used to cost $320,000 to train, it now costs $10,000 to train

If it used to require 32 GPU's, it now requires 1 GPU

I really appreciate the explanation! It’s very detailed and helpful. In that statement I was mostly referring to how can you get something with less precision than 1 bit per parameter. How low precision can it get!?! I would be kind of sad if 1bit is the limit. I would assume there is some compression to make it even less.

MichaelBarryUK

Oct 18, 2023

•

edited Oct 18, 2023

Honestly its way too technical for me, I categorize compression as "black magic"

But using statistics and heuristics they are able to compress data to less than a single bit. This is how JPG works. Whether it's possible for neural networks remains to be seen, but I wouldn't be surprised if someone cracks it.

JPG can do 90%, so 0.1 bits per parameter would probably be limit

With that level of compression you could fit GPT4 on a consumer GPU

gertjfox

Oct 18, 2023

•

edited Oct 18, 2023

Wow this changes everything!! Well written paper. GPT-4+ level performance and giant models using regular CPU and ram. Training costs 40-50x less, opening up a whole new paradigm. What a time to be alive! Excited to see the open source community adapt it, we could be seeing quantised models as early as next week..

Suprising to see MSFT research is involved as this could jeopardize their/openai business models and control of AI safety! Where is the source code? We should safeguard it from controlling (government) hands.

Spico

Oct 18, 2023

This is incredibly insane.

librarian-bot

Oct 19, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Guggu

Oct 23, 2023

Need ablation study on increase in time complexity due to on the fly quantisation and de-quantisation.

sedthh

Oct 24, 2023

Can someone explain to me why GeLU activation is used right after a bitlinear layer? Wouldn't both the input and the weights be quantized? How does a ReLU / GeLU non-linearity even affect a layer with {0, 1} output?

craigmcm

Mar 30

I believe another overlooked benefit of these -1, 0 or 1 value tokens is that googles alpha tensor has now found various improved methods of multiplying these matrices together depending on matrix sizes there are 2 fewer (if i i remeber best algo prior to this discovery took 49 steps vs AlphaTensors 47) and depending on matrix sizes multiplication to as many as 4 fewer steps (4% to 8% less multiplications) combined with the actual multiplication not being floating point calculations I'm sure there is an expected speed up here

kdheeraz

Mar 5

•

edited Mar 5

Doesn't it look like analog to digital conversion of weights ?
And the quantization reminds me of sampling theorem.
And yes it resembles a lot what Claude Shannon said !!