Hardware requirement
Anyone knows how big should my VRAM be if I want to run this model?Thx.
The model weights themselves need around 9 ish GB of VRAM. Depending on what serving framework you are using and your context length for prompt + answer, maybe reserve another 1-2 GB just to be safe. This means that at a minimum, you should serve this using a 12GB VRAM Nvidia card (something like an Nvidia RTX 3060, T4 and etc).
If you have a lower VRAM GPU, perhaps consider our other 4 bit GPTQ quant model here at https://huggingface.co/astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit. This should fit in under 8GB VRAM.
Both quants have been tested in transformers, huggingface pipeline, and vLLM. We are running additional testing on HF's text generation inference and text-generation-webui from oobabooga. The performance metrics and sample code used will be posted shortly.
Thank u so much.♥