Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Abstract
The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.
Community
Introducing FLUTE, a CUDA kernel for non-uniformly quantized (via a lookup table) LLM Inference. It accelerates QLoRA's NormalFloat (NF) out of the box and more.
As an application, we extended NF4 and are releasing quantized models for LLaMA-3 (8B/70B) and Gemma-2 (9B/27B).
Highlights:
- Support arbitrary mapping between (de)quantized tensors via a lookup table (INT4, FP4, NF4, etc).
- Up to 2-3x faster than dense GEMM kernels.
- 1.3x to 2.6x end-to-end latency improvement (vLLM).
- Batch inference, 4/3-bits, and various group sizes.
- vLLM integration and 10+ pre-quantized models off-the-shelf.
And, for those who care:
- Almost entirely written in CUTLASS 3 (i.e., CuTe).
- Uses TensorCore, Async Copy, and Stream-K.
- Tailored to Ampere GPUs.
Paper: https://arxiv.org/abs/2407.10960
Code: https://github.com/HanGuo97/flute
@HanGuo97 @exists_forall @radi_cho @jrk @ericxing @yoonrkim
Hi @radi-cho congrats on this work!
I see you uploaded models to the hub, would be great to link them to this paper!
See here on how to do that: https://huggingface.co/docs/hub/en/model-cards#linking-a-paper
Let me know if you need any help!
Cheers,
Niels
Open-source @ HF
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge (2024)
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (2024)
- QQQ: Quality Quattuor-Bit Quantization for Large Language Models (2024)
- LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid (2024)
- Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper