Running Inference Server

#6
by Vista4334 - opened

I'm struggling to start a server that is capable of running this model, just because of the size. I am inexperienced in running very large models. How would I approach this issue on a machine with 8x A100 with 80GB memory each? Just for inference, not for training.

GOAT.AI org

Hi, even 2 A100 is more than enough to host the model. Did you look into TGI? They have a good documentation, that's the easiest way - it boils down to pulling their Docker image and writing a correct launch command (don't forget to set bfloat16 precision): https://huggingface.co/docs/text-generation-inference/en/index

I created an inference endpoint in AWS using 2x A100s (recommended configuration). After setting the endpoint URL and using the sample python code to run generate_story()I get the following error on the first API request:

Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: captures_underway == 0 INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDACachingAllocator.cpp":2939, please report a bug to PyTorch.

Sign up or log in to comment