HuggingFaceH4/falcon-chat · Inference Endpoint Setup

Jun 6, 2023

Awesome work!

May I ask which setup you chose for the model to run? I tried to run it via docker on a server with 2 x A100 80G und 142GB RAM - but everything I got was an infinite loop of "Waiting for shard 0/1 to be ready...".

A hint would be really cool!

lewtun

Hugging Face H4 org Jun 7, 2023

Hi @JulianGerhard ! We're currently running this on 2 x A100 (80GB) so it does indeed seem like there's an issue with Inference Endpoints. I've alerted the team internally - thanks!

lewtun

Hugging Face H4 org Jun 7, 2023

Hey @JulianGerhard I chatted with @philschmid and he showed me that one can deploy the 40B model on Inference Endpoints with 1 x A100 (80GB) by enabling quantization with the text-generation-inference container:

If you're having trouble with text-generation-inference itself, I recommend opening an issue there

JulianGerhard

Jun 7, 2023

Hi @lewtun ,

first of all - thanks a lot! for this detailed answer. Philipp told me that you have a limited amount of A100 and because my current use case is prior my own interest I do not want to occupy valuable resources in the meantime.

I started to experiment with the library itsself and was successful in starting my own inference endpoint. It may be remarkable for following readers, that even with a capable system like mine, the loading time of the shards with quantization lasts about 30 minutes.

Kind regards and thanks again
Julian

JulianGerhard changed discussion status to closed Jun 7, 2023