MixDQ Model Card

Model Description

MixDQ is a mixed precision quantization methods that compress the memory and computational usage of text-to-image diffusion models while preserving genration quality. It supports few-step diffusion models (e.g., SDXL-turbo, LCM-lora) to construct both fast and tiny diffusion models. Efficient CUDA kernel implemention is provided for practical resource savings.

Model Sources

for more information, please refer to:

Project Page: https://a-suozhang.xyz/mixdq.github.io/.
Arxiv paper: https://arxiv.org/abs/2405.17873
Github Repository: https://github.com/A-suozhang/MixDQ

Evaluation

We evaluate the MixDQ model using various metrics, including FID (fidelity), CLIPScore (image-text alignment), and ImageReward (human preference). MixDQ can achieve W8A8 quantization without performance loss. The differences between images generated by MixDQ and those generated by FP16 models are negligible.

Method	FID (↓)	ClipScore	ImageReward
FP16	17.15	0.2722	0.8631
MixDQ-W8A8	17.03	0.2703	0.8415
MixDQ-W5A8	17.23	0.2697	0.8307

Usage

install the prerequisite for Mixdq:

  # The Python versions required to run mixdq: 3.8, 3.9, 3.10
  pip install -i https://pypi.org/simple/ mixdq-extension

run the pipeline:

  pipe = DiffusionPipeline.from_pretrained(
      "stabilityai/sdxl-turbo", custom_pipeline="nics-efc/MixDQ",
      torch_dtype=torch.float16, variant="fp16"
  )

  # quant the UNet
  pipe.quantize_unet(
                  w_bit = 8, 
                  a_bit = 8, 
                  bos=True, 
                  )

  # The set_cuda_graph func is optional and used for acceleration
  pipe.set_cuda_graph(
      run_pipeline = True,
  )

  # test the memory and the lantency of the pipeline or the UNet
  pipe.run_for_test(
      device="cuda",
      output_type="pil",
      run_pipeline=True,
      path="pipeline_test.png",
      profile=True
  )
  '''
  After execution is finished, there will be a report under log/sdxl folder in formats of json.
  This report can be opened by tensorboard for users to examine profiling results:
  tensorboard --logdir=./log
  '''

  # run the pipeline
  pipe = pipe.to("cuda")
  prompts = "A black Honda motorcycle parked in front of a garage."
  image = pipe(prompts, num_inference_steps=1, guidance_scale=0.0).images[0]  
  image.save('mixdq_pipeline.png')

Performance tested on NVIDIA 4080:

UNet Latency (ms)	No CUDA Graph	With CUDA Graph
FP16 version	44.6	36.1
Quantized version	59.1	24.9
Speedup	0.75	1.45

nics-efc
/

MixDQ

MixDQ Model Card

Model Description

Model Sources

Evaluation

Usage

Model tree for nics-efc/MixDQ