|
--- |
|
title: VPTQ Demo |
|
emoji: π |
|
colorFrom: blue |
|
colorTo: green |
|
sdk: static |
|
|
|
license: mit |
|
short_description: Vector Post Training Quantization Inference Demo |
|
--- |
|
|
|
Vector Post-Training Quantization (VPTQ) is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy. |
|
|
|
* Better Accuracy on 1-2 bits, (405B @ <2bit, 70B @ 2bit) |
|
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1 |
|
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT |
|
|
|
[Github/Codes](https://github.com/microsoft/VPTQ) |
|
|
|
[Online Demo](https://huggingface.co/spaces/microsoft/VPTQ) |
|
|
|
[Paper](https://arxiv.org/abs/2409.17066) |
|
|
|
|
|
|