Efficient Inference Kernel Support for 1.58bit.
#8
by
LeiWang1999
- opened
Checkout this repo guys ! π
https://github.com/microsoft/BitBLAS/tree/main/integration/BitNet
BitBLAS Results
Performance
Note: To reproduce the results of BitBLAS, Please checkout the benchmark_inference_latency.py
. To reproduce the results of the original model, Please checkout the 1bitLLM/bitnet_b1_58-3B repo.
Model | Device | batchsize | in_seq | model | bitnet-1.58b-3b-huggingface | bitnet-1.58b-3b-bitblas |
---|---|---|---|---|---|---|
bitnet_b1_58-3B | A100 | 1 | 1 | LLAMA-3B | 177.6729107 | 64.17962909 |
bitnet_b1_58-3B | A100 | 128 | 1 | LLAMA-3B | 188.6145592 | 63.48158518 |
bitnet_b1_58-3B | A100 | 1 | 2048 | LLAMA-3B | 348.7066031 | 202.6877999 |
On-the-Fly GPU Memory Footprint
We measured the GPU memory footprint through the nvidia-smi
command. Please checkout nvidia_measure_memory.sh
to get the real-time GPU memory usage. And then start a benchmark_model_10k_loops.py
workload to measure the overall GPU memory usage.
Model | Device | batchsize | in_seq | bitnet-1.58b-3b-huggingface | bitnet-1.58b-3b-bitblas |
---|---|---|---|---|---|
bitnet_b1_58-3B | A100 | 1 | 1 | 7595 MB | 1729 MB |
bitnet_b1_58-3B | A100 | 128 | 1 | 7677 MB | 1789 MB |
bitnet_b1_58-3B | A100 | 1 | 2048 | 8731 MB | 3163 MB |
Just simplely replace the inference kernel of The BitnetLinear