FlatQuant: Flatness Matters for LLM Quantization
Abstract
Recently, quantization has been widely used for the compression and acceleration of large language models~(LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments show that FlatQuant sets up a new state-of-the-art quantization benchmark. For instance, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely 0.07x, bringing up to 2.3x speedup for prefill and 1.7x speedup for decoding, respectively. Code is available at: https://github.com/ruikangliu/FlatQuant.
Community
The contributions of this work are summarized below:
- We highlight the significance of achieving flatness for LLM quantization, demonstrating that flat distributions of weights and activations facilitate quantization and reduce error propagation across Transformer layers.
- We introduce FLATQUANT, a new post-training quantization method with fast and learn-able affine transformations optimized for each linear layer. The approach is empirically demonstrated to enhance the flatness of both weights and activations in LLMs.
- Extensive experiments demonstrate that FLATQUANT sets new state-of-the-art results for quantization. To the best of our knowledge, we are the first to achieve ≤ 1% accuracy drop with simply round-to-nearest W4A4 quantization on the LLaMA-3-70B model.
- We have designed an efficient kernel that fuses affine transformation and quantization, reducing the additional latency caused by transformation from a 0.26x slowdown with QuaRot to only 0.07x. This enhancement gives up to 2.3x speedup for prefill and 1.7x speedup for decoding compared to the FP16 baseline.
The code is available at https://github.com/ruikangliu/FlatQuant.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs (2024)
- Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference (2024)
- CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression (2024)
- MobileQuant: Mobile-friendly Quantization for On-device Language Models (2024)
- VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper