Investigate better queue

#1
by adamelliotfields - opened

Right now we have a single generate function that takes a model name. Under the hood, this function is assigned a concurrency_id which is used by Gradio's queue to form concurrency groups. It also results in a simple REST API with only one /generate endpoint.

It might make more sense for the model to be tied to a button so multiple groups can be formed. This would also create multiple API endpoints and API requests for a model would be grouped accordingly.

For the UI, try toggling the visibility of the other buttons so it always appears as if there is a single button. Leave the current model dropdown and this button visibility is toggled on change.

Fal's hot model routing is pretty much what I was envisioning. They use a HTTP header to provide a routing hint so the request goes to a worker with the model loaded.

This was a neat idea, but in reality performance on ZeroGPU is never going to be great because of the latency waiting for a GPU. This sort of optimization would only make sense if we were actually developing our own Stable Diffusion API service.

Edit: I actually don't think the Gradio queue makes a difference with ZeroGPU. You can only fit a single Flux pipeline on a 40GB GPU, yet the Flux space can generate 30+ images simultaneously with the default concurrency limit of 1. So, I think this is more of an infra solution like Kubernetes pod autoscaling. This means we don't need the model swapping logic when deployed to ZeroGPU.

adamelliotfields changed discussion status to closed

Sign up or log in to comment