Not able to use it with TGI

#5
by Alokgupta96 - opened

export model=/data/LLama31-FP8/
export volume=/mnt/LLM_Compressor

docker run
--gpus '"device=2"'
--shm-size 1g
-p 8085:80
-v $volume:/data
ghcr.io/huggingface/text-generation-inference:latest
--model-id $model
--max-top-n-tokens 1
--max-total-tokens 4096
--max-input-length 2048
--max-best-of 1
--cuda-memory-fraction 0.9
--trust-remote-code
--max-batch-prefill-tokens 2048 \

I am using the above bash script to run
The Docker does get started but is returning garbage text.

Neural Magic org

Hi, @Alokgupta96 , I recommend you use this model with vLLM (https://github.com/vllm-project/vllm), where we at Neural Magic have added support for this model, added CUTLASS-based w8a8 fp8 kernels for further optimization, and added an FP8 Marlin kernel for use on Ampere GPUs.

Sign up or log in to comment