astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit

Apr 20

Anyone knows how big should my VRAM be if I want to run this model？Thx.

Astronomer org Apr 20

The model weights themselves need around 9 ish GB of VRAM. Depending on what serving framework you are using and your context length for prompt + answer, maybe reserve another 1-2 GB just to be safe. This means that at a minimum, you should serve this using a 12GB VRAM Nvidia card (something like an Nvidia RTX 3060, T4 and etc).

If you have a lower VRAM GPU, perhaps consider our other 4 bit GPTQ quant model here at https://huggingface.co/astronomer-io/Llama-3-8B-Instruct-GPTQ-4-Bit. This should fit in under 8GB VRAM.

Both quants have been tested in transformers, huggingface pipeline, and vLLM. We are running additional testing on HF's text generation inference and text-generation-webui from oobabooga. The performance metrics and sample code used will be posted shortly.

Dtree07

Apr 20

Thank u so much.♥

astronomer
/

Llama-3-8B-Instruct-GPTQ-8-Bit

Hardware requirement