Spaces:
Running
Inference Endpoint Setup
Awesome work!
May I ask which setup you chose for the model to run? I tried to run it via docker on a server with 2 x A100 80G und 142GB RAM - but everything I got was an infinite loop of "Waiting for shard 0/1 to be ready...".
A hint would be really cool!
Hi @JulianGerhard ! We're currently running this on 2 x A100 (80GB) so it does indeed seem like there's an issue with Inference Endpoints. I've alerted the team internally - thanks!
Hey
@JulianGerhard
I chatted with
@philschmid
and he showed me that one can deploy the 40B model on Inference Endpoints with 1 x A100 (80GB) by enabling quantization with the text-generation-inference
container:
If you're having trouble with text-generation-inference
itself, I recommend opening an issue there
Hi @lewtun ,
first of all - thanks a lot! for this detailed answer. Philipp told me that you have a limited amount of A100 and because my current use case is prior my own interest I do not want to occupy valuable resources in the meantime.
I started to experiment with the library itsself and was successful in starting my own inference endpoint. It may be remarkable for following readers, that even with a capable system like mine, the loading time of the shards with quantization lasts about 30 minutes.
Kind regards and thanks again
Julian