Vezora
/

Qwen2.5-Coder-32B-Instruct-fp8-W8A16

Model card Files Files and versions Community

Vezora commited on 17 days ago

Commit

cbc10d6

•

1 Parent(s): 3a74439

Update README.md

Files changed (1) hide show

README.md +19 -3

README.md CHANGED Viewed

@@ -1,3 +1,19 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+## Overview
+This model can be run using [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.) as a weight-only W8A16 model, utilizing FP8 Marlin.
+The Marlin kernel achieves impressive efficiency by packing 4 8-bit values in an int32 and performing 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach enables nearly a **2x speedup** over FP16 on most models while maintaining almost **completely lossless quality**!
+### FP8 Marlin Details
+- Introduced by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin leverages NVIDIA's GPU architecture to deliver a compact, high-performance format.
+- FP8 achieves nearly lossless compression, making it ideal for models where quantization errors in traditional formats like int4 or int8 may degrade performance.
+### Why FP8?
+I uploaded this FP8-quantized model to experiment with high-precision code handling. Traditional int4 quantization on models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4` sometimes resulted in poor outputs, with repeated tokens due to quantization errors. FP8 format, however, **does not require calibration data** and provides robust, lossless compression.
+As demonstrated in Neural Magic’s recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has difficulty recovering fidelity from FP16 unless it’s calibrated carefully. FP8, however, especially in the W8A16 format used here, maintains high-quality outputs without the need for extensive calibration, making it a reliable and performant solution for high-precision applications like code generation.