Running Inference Server

by Vista4334 - opened Apr 2

Apr 2

I'm struggling to start a server that is capable of running this model, just because of the size. I am inexperienced in running very large models. How would I approach this issue on a machine with 8x A100 with 80GB memory each? Just for inference, not for training.

ozyman

GOAT.AI org Apr 11

Hi, even 2 A100 is more than enough to host the model. Did you look into TGI? They have a good documentation, that's the easiest way - it boils down to pulling their Docker image and writing a correct launch command (don't forget to set bfloat16 precision): https://huggingface.co/docs/text-generation-inference/en/index

davetropeano

Apr 12

I created an inference endpoint in AWS using 2x A100s (recommended configuration). After setting the endpoint URL and using the sample python code to run generate_story()I get the following error on the first API request:

Request failed during generation: Server error: Unexpected <class 'RuntimeError'>: captures_underway == 0 INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDACachingAllocator.cpp":2939, please report a bug to PyTorch.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment