Vezora commited on
Commit
cbc10d6
1 Parent(s): 3a74439

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -3
README.md CHANGED
@@ -1,3 +1,19 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ ## Overview
5
+
6
+ This model can be run using [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.) as a weight-only W8A16 model, utilizing FP8 Marlin.
7
+
8
+ The Marlin kernel achieves impressive efficiency by packing 4 8-bit values in an int32 and performing 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach enables nearly a **2x speedup** over FP16 on most models while maintaining almost **completely lossless quality**!
9
+
10
+ ### FP8 Marlin Details
11
+ - Introduced by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin leverages NVIDIA's GPU architecture to deliver a compact, high-performance format.
12
+ - FP8 achieves nearly lossless compression, making it ideal for models where quantization errors in traditional formats like int4 or int8 may degrade performance.
13
+
14
+ ### Why FP8?
15
+
16
+ I uploaded this FP8-quantized model to experiment with high-precision code handling. Traditional int4 quantization on models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4` sometimes resulted in poor outputs, with repeated tokens due to quantization errors. FP8 format, however, **does not require calibration data** and provides robust, lossless compression.
17
+
18
+ As demonstrated in Neural Magic’s recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has difficulty recovering fidelity from FP16 unless it’s calibrated carefully. FP8, however, especially in the W8A16 format used here, maintains high-quality outputs without the need for extensive calibration, making it a reliable and performant solution for high-precision applications like code generation.
19
+