Abstract
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Community
Holy. Mother. Of. God.
This changes everything.
If this scales we are looking at 180B models on a 3090
Or a 40B model on an Iphone
What's next? Multiple parameters per bit...? Sounds impossible, but we do it with JPG.
Can you simplify any further????
Can you simplify any further????
Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.
They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)
So it's 2 billion times less precise, and yet it retains 99% of the accuracy
It is 32 times smaller, 32 times faster, 32 times cheaper.
If it used to cost $320,000 to train, it now costs $10,000 to train
If it used to require 32 GPU's, it now requires 1 GPU
Can you simplify any further????
Normally we use 32bit datatype for each parameter. This precision allows for about 4 billion possible values per parameter.
They have now squeezed that down to 1 bit per parameter, which has two possible values (0 or 1)
So it's 2 billion times less precise, and yet it retains 99% of the accuracy
It is 32 times smaller, 32 times faster, 32 times cheaper.
If it used to cost $320,000 to train, it now costs $10,000 to train
If it used to require 32 GPU's, it now requires 1 GPU
I really appreciate the explanation! It’s very detailed and helpful. In that statement I was mostly referring to how can you get something with less precision than 1 bit per parameter. How low precision can it get!?! I would be kind of sad if 1bit is the limit. I would assume there is some compression to make it even less.
Honestly its way too technical for me, I categorize compression as "black magic"
But using statistics and heuristics they are able to compress data to less than a single bit. This is how JPG works. Whether it's possible for neural networks remains to be seen, but I wouldn't be surprised if someone cracks it.
JPG can do 90%, so 0.1 bits per parameter would probably be limit
With that level of compression you could fit GPT4 on a consumer GPU
Wow this changes everything!! Well written paper. GPT-4+ level performance and giant models using regular CPU and ram. Training costs 40-50x less, opening up a whole new paradigm. What a time to be alive! Excited to see the open source community adapt it, we could be seeing quantised models as early as next week..
Suprising to see MSFT research is involved as this could jeopardize their/openai business models and control of AI safety! Where is the source code? We should safeguard it from controlling (government) hands.
This is incredibly insane.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Understanding the Impact of Post-Training Quantization on Large Language Models (2023)
- Towards End-to-end 4-Bit Inference on Generative Large Language Models (2023)
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models (2023)
- PB-LLM: Partially Binarized Large Language Models (2023)
- TEQ: Trainable Equivalent Transformation for Quantization of LLMs (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Need ablation study on increase in time complexity due to on the fly quantisation and de-quantisation.
Can someone explain to me why GeLU activation is used right after a bitlinear layer? Wouldn't both the input and the weights be quantized? How does a ReLU / GeLU non-linearity even affect a layer with {0, 1} output?
I believe another overlooked benefit of these -1, 0 or 1 value tokens is that googles alpha tensor has now found various improved methods of multiplying these matrices together depending on matrix sizes there are 2 fewer (if i i remeber best algo prior to this discovery took 49 steps vs AlphaTensors 47) and depending on matrix sizes multiplication to as many as 4 fewer steps (4% to 8% less multiplications) combined with the actual multiplication not being floating point calculations I'm sure there is an expected speed up here
Doesn't it look like analog to digital conversion of weights ?
And the quantization reminds me of sampling theorem.
And yes it resembles a lot what Claude Shannon said !!
BitNet: Energy-Efficient 1-bit Transformers for Large Language Models!
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper