Spaces:
Running
on
Zero
Investigate better queue
Right now we have a single generate
function that takes a model name. Under the hood, this function is assigned a concurrency_id
which is used by Gradio's queue to form concurrency groups. It also results in a simple REST API with only one /generate
endpoint.
It might make more sense for the model to be tied to a button so multiple groups can be formed. This would also create multiple API endpoints and API requests for a model would be grouped accordingly.
For the UI, try toggling the visibility of the other buttons so it always appears as if there is a single button. Leave the current model dropdown and this button visibility is toggled on change.
Fal's hot model routing is pretty much what I was envisioning. They use a HTTP header to provide a routing hint so the request goes to a worker with the model loaded.
This was a neat idea, but in reality performance on ZeroGPU is never going to be great because of the latency waiting for a GPU. This sort of optimization would only make sense if we were actually developing our own Stable Diffusion API service.
Edit: I actually don't think the Gradio queue makes a difference with ZeroGPU. You can only fit a single Flux pipeline on a 40GB GPU, yet the Flux space can generate 30+ images simultaneously with the default concurrency limit of 1. So, I think this is more of an infra solution like Kubernetes pod autoscaling. This means we don't need the model swapping logic when deployed to ZeroGPU.