1bitLLM/bitnet_b1_58-3B · Efficient Inference Kernel Support for 1.58bit.

Checkout this repo guys ! 🙂

https://github.com/microsoft/BitBLAS/tree/main/integration/BitNet

BitBLAS Results

Performance

Note: To reproduce the results of BitBLAS, Please checkout the benchmark_inference_latency.py. To reproduce the results of the original model, Please checkout the 1bitLLM/bitnet_b1_58-3B repo.

Model	Device	batchsize	in_seq	model	bitnet-1.58b-3b-huggingface	bitnet-1.58b-3b-bitblas
bitnet_b1_58-3B	A100	1	1	LLAMA-3B	177.6729107	64.17962909
bitnet_b1_58-3B	A100	128	1	LLAMA-3B	188.6145592	63.48158518
bitnet_b1_58-3B	A100	1	2048	LLAMA-3B	348.7066031	202.6877999

On-the-Fly GPU Memory Footprint

We measured the GPU memory footprint through the nvidia-smi command. Please checkout nvidia_measure_memory.sh to get the real-time GPU memory usage. And then start a benchmark_model_10k_loops.py workload to measure the overall GPU memory usage.

Model	Device	batchsize	in_seq	bitnet-1.58b-3b-huggingface	bitnet-1.58b-3b-bitblas
bitnet_b1_58-3B	A100	1	1	7595 MB	1729 MB
bitnet_b1_58-3B	A100	128	1	7677 MB	1789 MB
bitnet_b1_58-3B	A100	1	2048	8731 MB	3163 MB

Just simplely replace the inference kernel of The BitnetLinear