Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,19 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
## Overview
|
5 |
+
|
6 |
+
This model can be run using [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.) as a weight-only W8A16 model, utilizing FP8 Marlin.
|
7 |
+
|
8 |
+
The Marlin kernel achieves impressive efficiency by packing 4 8-bit values in an int32 and performing 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach enables nearly a **2x speedup** over FP16 on most models while maintaining almost **completely lossless quality**!
|
9 |
+
|
10 |
+
### FP8 Marlin Details
|
11 |
+
- Introduced by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin leverages NVIDIA's GPU architecture to deliver a compact, high-performance format.
|
12 |
+
- FP8 achieves nearly lossless compression, making it ideal for models where quantization errors in traditional formats like int4 or int8 may degrade performance.
|
13 |
+
|
14 |
+
### Why FP8?
|
15 |
+
|
16 |
+
I uploaded this FP8-quantized model to experiment with high-precision code handling. Traditional int4 quantization on models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4` sometimes resulted in poor outputs, with repeated tokens due to quantization errors. FP8 format, however, **does not require calibration data** and provides robust, lossless compression.
|
17 |
+
|
18 |
+
As demonstrated in Neural Magic’s recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has difficulty recovering fidelity from FP16 unless it’s calibrated carefully. FP8, however, especially in the W8A16 format used here, maintains high-quality outputs without the need for extensive calibration, making it a reliable and performant solution for high-precision applications like code generation.
|
19 |
+
|