{ "cells": [ { "cell_type": "markdown", "id": "06f95d7d-988c-4d52-b9d9-10ac6345c58c", "metadata": {}, "source": [ "# Image Editing with InstructPix2Pix and OpenVINO\n", "\n", "The InstructPix2Pix is a conditional diffusion model that edits images based on written instructions provided by the user.\n", "Generative image editing models traditionally target a single editing task like style transfer or translation between image domains. Text guidance gives us an opportunity to solve multiple tasks with a single model.\n", "The InstructPix2Pix method works different than existing text-based image editing in that it enables editing from instructions that tell the model what action to perform instead of using text labels, captions or descriptions of input/output images. A key benefit of following editing instructions is that the user can just tell the model exactly what to do in natural written text. There is no need for the user to provide extra information, such as example images or descriptions of visual content that remain constant between the input and output images. More details about this approach can be found in this [paper](https://arxiv.org/pdf/2211.09800.pdf) and [repository](https://github.com/timothybrooks/instruct-pix2pix).\n", "\n", "This notebook demonstrates how to convert and run the InstructPix2Pix model using OpenVINO.\n", "\n", "Notebook contains the following steps:\n", "\n", "1. Convert PyTorch models to OpenVINO IR format, using Model Conversion API.\n", "2. Run InstructPix2Pix pipeline with OpenVINO.\n", "3. Optimize InstructPix2Pix pipeline with [NNCF](https://github.com/openvinotoolkit/nncf/) quantization.\n", "4. Compare results of original and optimized pipelines." ] }, { "cell_type": "markdown", "id": "0defb0be", "metadata": {}, "source": [ "\n", "#### Table of contents:\n", "\n", "- [Prerequisites](#Prerequisites)\n", "- [Create Pytorch Models pipeline](#Create-Pytorch-Models-pipeline)\n", "- [Convert Models to OpenVINO IR](#Convert-Models-to-OpenVINO-IR)\n", " - [Text Encoder](#Text-Encoder)\n", " - [VAE](#VAE)\n", " - [Unet](#Unet)\n", "- [Prepare Inference Pipeline](#Prepare-Inference-Pipeline)\n", "- [Quantization](#Quantization)\n", " - [Prepare calibration dataset](#Prepare-calibration-dataset)\n", " - [Run quantization](#Run-quantization)\n", " - [Compare inference time of the FP16 and INT8 models](#Compare-inference-time-of-the-FP16-and-INT8-models)\n", "- [Interactive demo with Gradio](#Interactive-demo-with-Gradio)\n", "\n" ] }, { "cell_type": "markdown", "id": "55d351b8-895e-4a14-893b-d4e765eda077", "metadata": {}, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Install necessary packages" ] }, { "cell_type": "code", "execution_count": null, "id": "42f23a38-22b8-420f-8f67-5c6b58e6c947", "metadata": {}, "outputs": [], "source": [ "import platform\n", "\n", "%pip install -q \"transformers>=4.25.1\" torch accelerate \"gradio>4.19\" \"datasets>=2.14.6\" diffusers pillow opencv-python --extra-index-url https://download.pytorch.org/whl/cpu\n", "%pip install -q \"openvino>=2023.1.0\"\n", "\n", "if platform.system() != \"Windows\":\n", " %pip install -q \"matplotlib>=3.4\"\n", "else:\n", " %pip install -q \"matplotlib>=3.4,<3.7\"" ] }, { "cell_type": "markdown", "id": "a546ab9e-ee97-4d0d-85bb-a1286f74cef5", "metadata": {}, "source": [ "## Create Pytorch Models pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "`StableDiffusionInstructPix2PixPipeline` is an end-to-end inference pipeline that you can use to edit images from text instructions with just a few lines of code provided as part 🤗🧨[diffusers](https://huggingface.co/docs/diffusers/index) library.\n", "\n", "First, we load the pre-trained weights of all components of the model.\n", "\n", "> **NOTE**: Initially, model loading can take some time due to downloading the weights. Also, the download speed depends on your internet connection." ] }, { "cell_type": "code", "execution_count": null, "id": "45aacf70-fb30-4099-b2c9-e081df6e8d38", "metadata": {}, "outputs": [], "source": [ "import torch\n", "from diffusers import (\n", " StableDiffusionInstructPix2PixPipeline,\n", " EulerAncestralDiscreteScheduler,\n", ")\n", "\n", "model_id = \"timbrooks/instruct-pix2pix\"\n", "pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float32, safety_checker=None)\n", "scheduler_config = pipe.scheduler.config\n", "text_encoder = pipe.text_encoder\n", "text_encoder.eval()\n", "unet = pipe.unet\n", "unet.eval()\n", "vae = pipe.vae\n", "vae.eval()\n", "\n", "del pipe" ] }, { "cell_type": "markdown", "id": "6d8d5bdc-7ced-411d-b385-c4b1331e8888", "metadata": {}, "source": [ "## Convert Models to OpenVINO IR\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "OpenVINO supports PyTorch models using [Model Conversion API](https://docs.openvino.ai/2024/openvino-workflow/model-preparation.html) to convert the model to IR format. `ov.convert_model` function accepts PyTorch model object and example input and then converts it to `ov.Model` class instance that ready to use for loading on device or can be saved on disk using `ov.save_model`.\n", "\n", "The InstructPix2Pix model is based on Stable Diffusion, a large-scale text-to-image latent diffusion model. You can find more details about how to run Stable Diffusion for text-to-image generation with OpenVINO in a separate [tutorial](../stable-diffusion-text-to-image/stable-diffusion-text-to-image.ipynb).\n", "\n", "\n", "The model consists of three important parts:\n", "\n", "* Text Encoder - to create conditions from a text prompt.\n", "* Unet - for step-by-step denoising latent image representation.\n", "* Autoencoder (VAE) - to encode the initial image to latent space for starting the denoising process and decoding latent space to image, when denoising is complete.\n", "\n", "Let us convert each part." ] }, { "cell_type": "markdown", "id": "8e26d41c-1b5e-4ed3-a75b-2fc0f7486efa", "metadata": {}, "source": [ "### Text Encoder\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The text-encoder is responsible for transforming the input prompt, for example, \"a photo of an astronaut riding a horse\" into an embedding space that can be understood by the UNet. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text embeddings.\n", "\n", "Input of the text encoder is tensor `input_ids`, which contains indexes of tokens from text processed by tokenizer and padded to maximum length accepted by the model. Model outputs are two tensors: `last_hidden_state` - hidden state from the last MultiHeadAttention layer in the model and `pooler_out` - pooled output for whole model hidden states." ] }, { "cell_type": "code", "execution_count": 2, "id": "98d13b52-75da-43f1-aa53-deebcac34e3c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text encoder will be loaded from text_encoder.xml\n" ] }, { "data": { "text/plain": [ "32" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pathlib import Path\n", "import openvino as ov\n", "import gc\n", "\n", "core = ov.Core()\n", "\n", "TEXT_ENCODER_OV_PATH = Path(\"text_encoder.xml\")\n", "\n", "\n", "def cleanup_torchscript_cache():\n", " \"\"\"\n", " Helper for removing cached model representation\n", " \"\"\"\n", " torch._C._jit_clear_class_registry()\n", " torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()\n", " torch.jit._state._clear_class_state()\n", "\n", "\n", "def convert_encoder(text_encoder: torch.nn.Module, ir_path: Path):\n", " \"\"\"\n", " Convert Text Encoder mode.\n", " Function accepts text encoder model, and prepares example inputs for conversion,\n", " Parameters:\n", " text_encoder (torch.nn.Module): text_encoder model from Stable Diffusion pipeline\n", " ir_path (Path): File for storing model\n", " Returns:\n", " None\n", " \"\"\"\n", " input_ids = torch.ones((1, 77), dtype=torch.long)\n", " # switch model to inference mode\n", " text_encoder.eval()\n", "\n", " # disable gradients calculation for reducing memory consumption\n", " with torch.no_grad():\n", " # Export model to IR format\n", " ov_model = ov.convert_model(\n", " text_encoder,\n", " example_input=input_ids,\n", " input=[\n", " (1, 77),\n", " ],\n", " )\n", " ov.save_model(ov_model, ir_path)\n", " del ov_model\n", " cleanup_torchscript_cache()\n", " print(f\"Text Encoder successfully converted to IR and saved to {ir_path}\")\n", "\n", "\n", "if not TEXT_ENCODER_OV_PATH.exists():\n", " convert_encoder(text_encoder, TEXT_ENCODER_OV_PATH)\n", "else:\n", " print(f\"Text encoder will be loaded from {TEXT_ENCODER_OV_PATH}\")\n", "\n", "del text_encoder\n", "gc.collect()" ] }, { "cell_type": "markdown", "id": "b325ddef-d1ee-422f-8e35-c7c289c96322", "metadata": {}, "source": [ "### VAE\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The VAE model consists of two parts: an encoder and a decoder.\n", "\n", "* The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the UNet model.\n", "* The decoder, conversely, transforms the latent representation back into an image.\n", "\n", "In comparison with a text-to-image inference pipeline, where VAE is used only for decoding, the pipeline also involves the original image encoding. As the two parts are used separately in the pipeline on different steps, and do not depend on each other, we should convert them into two independent models." ] }, { "cell_type": "code", "execution_count": 3, "id": "15b75491-2c27-4d21-bcee-554a9fd66675", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VAE encoder will be loaded from vae_encoder.xml\n", "VAE decoder will be loaded from vae_decoder.xml\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "VAE_ENCODER_OV_PATH = Path(\"vae_encoder.xml\")\n", "\n", "\n", "def convert_vae_encoder(vae: torch.nn.Module, ir_path: Path):\n", " \"\"\"\n", " Convert VAE model for encoding to IR format.\n", " Function accepts vae model, creates wrapper class for export only necessary for inference part,\n", " prepares example inputs for conversion,\n", " Parameters:\n", " vae (torch.nn.Module): VAE model from StableDiffusio pipeline\n", " ir_path (Path): File for storing model\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " class VAEEncoderWrapper(torch.nn.Module):\n", " def __init__(self, vae):\n", " super().__init__()\n", " self.vae = vae\n", "\n", " def forward(self, image):\n", " return self.vae.encode(x=image)[\"latent_dist\"].sample()\n", "\n", " vae_encoder = VAEEncoderWrapper(vae)\n", " vae_encoder.eval()\n", " image = torch.zeros((1, 3, 512, 512))\n", " with torch.no_grad():\n", " ov_model = ov.convert_model(vae_encoder, example_input=image, input=[((1, 3, 512, 512),)])\n", " ov.save_model(ov_model, ir_path)\n", " del ov_model\n", " cleanup_torchscript_cache()\n", " print(f\"VAE encoder successfully converted to IR and saved to {ir_path}\")\n", "\n", "\n", "if not VAE_ENCODER_OV_PATH.exists():\n", " convert_vae_encoder(vae, VAE_ENCODER_OV_PATH)\n", "else:\n", " print(f\"VAE encoder will be loaded from {VAE_ENCODER_OV_PATH}\")\n", "\n", "VAE_DECODER_OV_PATH = Path(\"vae_decoder.xml\")\n", "\n", "\n", "def convert_vae_decoder(vae: torch.nn.Module, ir_path: Path):\n", " \"\"\"\n", " Convert VAE model for decoding to IR format.\n", " Function accepts vae model, creates wrapper class for export only necessary for inference part,\n", " prepares example inputs for conversion,\n", " Parameters:\n", " vae (torch.nn.Module): VAE model frm StableDiffusion pipeline\n", " ir_path (Path): File for storing model\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " class VAEDecoderWrapper(torch.nn.Module):\n", " def __init__(self, vae):\n", " super().__init__()\n", " self.vae = vae\n", "\n", " def forward(self, latents):\n", " return self.vae.decode(latents)\n", "\n", " vae_decoder = VAEDecoderWrapper(vae)\n", " latents = torch.zeros((1, 4, 64, 64))\n", "\n", " vae_decoder.eval()\n", " with torch.no_grad():\n", " ov_model = ov.convert_model(vae_decoder, example_input=latents, input=[((1, 4, 64, 64),)])\n", " ov.save_model(ov_model, ir_path)\n", " del ov_model\n", " cleanup_torchscript_cache()\n", " print(f\"VAE decoder successfully converted to IR and saved to {ir_path}\")\n", "\n", "\n", "if not VAE_DECODER_OV_PATH.exists():\n", " convert_vae_decoder(vae, VAE_DECODER_OV_PATH)\n", "else:\n", " print(f\"VAE decoder will be loaded from {VAE_DECODER_OV_PATH}\")\n", "\n", "del vae\n", "gc.collect()" ] }, { "cell_type": "markdown", "id": "fd9b3a94-5948-42a7-ac17-c569c333c0af", "metadata": {}, "source": [ "### Unet\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The Unet model has three inputs:\n", "\n", "* `scaled_latent_model_input` - the latent image sample from previous step. Generation process has not been started yet, so you will use random noise.\n", "* `timestep` - a current scheduler step.\n", "* `text_embeddings` - a hidden state of the text encoder.\n", "\n", "Model predicts the `sample` state for the next step." ] }, { "cell_type": "code", "execution_count": 4, "id": "4615b5a7-2ff0-4300-a08b-8ba3965f0bf0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unet will be loaded from unet.xml\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "UNET_OV_PATH = Path(\"unet.xml\")\n", "\n", "dtype_mapping = {torch.float32: ov.Type.f32, torch.float64: ov.Type.f64}\n", "\n", "\n", "def convert_unet(unet: torch.nn.Module, ir_path: Path):\n", " \"\"\"\n", " Convert U-net model to IR format.\n", " Function accepts unet model, prepares example inputs for conversion,\n", " Parameters:\n", " unet (StableDiffusionPipeline): unet from Stable Diffusion pipeline\n", " ir_path (Path): File for storing model\n", " Returns:\n", " None\n", " \"\"\"\n", " # prepare inputs\n", " encoder_hidden_state = torch.ones((3, 77, 768))\n", " latents_shape = (3, 8, 512 // 8, 512 // 8)\n", " latents = torch.randn(latents_shape)\n", " t = torch.from_numpy(np.array(1, dtype=float))\n", " dummy_inputs = (latents, t, encoder_hidden_state)\n", " input_info = []\n", " for input_tensor in dummy_inputs:\n", " shape = ov.PartialShape(tuple(input_tensor.shape))\n", " element_type = dtype_mapping[input_tensor.dtype]\n", " input_info.append((shape, element_type))\n", "\n", " unet.eval()\n", " with torch.no_grad():\n", " ov_model = ov.convert_model(unet, example_input=dummy_inputs, input=input_info)\n", " ov.save_model(ov_model, ir_path)\n", " del ov_model\n", " cleanup_torchscript_cache()\n", " print(f\"Unet successfully converted to IR and saved to {ir_path}\")\n", "\n", "\n", "if not UNET_OV_PATH.exists():\n", " convert_unet(unet, UNET_OV_PATH)\n", " gc.collect()\n", "else:\n", " print(f\"Unet will be loaded from {UNET_OV_PATH}\")\n", "del unet\n", "gc.collect()" ] }, { "cell_type": "markdown", "id": "3f31307c-8ddd-46f4-b9a6-3f2709bb2a93", "metadata": {}, "source": [ "## Prepare Inference Pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Putting it all together, let us now take a closer look at how the model inference works by illustrating the logical flow.\n", "\n", "![diagram](https://user-images.githubusercontent.com/29454499/214895365-3063ac11-0486-4d9b-9e25-8f469aba5e5d.png)\n", "\n", "The InstructPix2Pix model takes both an image and a text prompt as an input. The image is transformed to latent image representations of size $64 \\times 64$, using the encoder part of variational autoencoder, whereas the text prompt is transformed to text embeddings of size $77 \\times 768$ via CLIP's text encoder.\n", "\n", "Next, the UNet model iteratively *denoises* the random latent image representations while being conditioned on the text embeddings. The output of the UNet, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm.\n", "\n", "The *denoising* process is repeated a given number of times (by default 100) to retrieve step-by-step better latent image representations.\n", "Once it has been completed, the latent image representation is decoded by the decoder part of the variational auto encoder.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "500b8c59-ad25-40b3-b2ae-8418c0170d14", "metadata": {}, "outputs": [], "source": [ "from diffusers import DiffusionPipeline\n", "from transformers import CLIPTokenizer\n", "from typing import Union, List, Optional, Tuple\n", "import PIL\n", "import cv2\n", "\n", "\n", "def scale_fit_to_window(dst_width: int, dst_height: int, image_width: int, image_height: int):\n", " \"\"\"\n", " Preprocessing helper function for calculating image size for resize with peserving original aspect ratio\n", " and fitting image to specific window size\n", "\n", " Parameters:\n", " dst_width (int): destination window width\n", " dst_height (int): destination window height\n", " image_width (int): source image width\n", " image_height (int): source image height\n", " Returns:\n", " result_width (int): calculated width for resize\n", " result_height (int): calculated height for resize\n", " \"\"\"\n", " im_scale = min(dst_height / image_height, dst_width / image_width)\n", " return int(im_scale * image_width), int(im_scale * image_height)\n", "\n", "\n", "def preprocess(image: PIL.Image.Image):\n", " \"\"\"\n", " Image preprocessing function. Takes image in PIL.Image format, resizes it to keep aspect ration and fits to model input window 512x512,\n", " then converts it to np.ndarray and adds padding with zeros on right or bottom side of image (depends from aspect ratio), after that\n", " converts data to float32 data type and change range of values from [0, 255] to [-1, 1], finally, converts data layout from planar NHWC to NCHW.\n", " The function returns preprocessed input tensor and padding size, which can be used in postprocessing.\n", "\n", " Parameters:\n", " image (PIL.Image.Image): input image\n", " Returns:\n", " image (np.ndarray): preprocessed image tensor\n", " pad (Tuple[int]): pading size for each dimension for restoring image size in postprocessing\n", " \"\"\"\n", " src_width, src_height = image.size\n", " dst_width, dst_height = scale_fit_to_window(512, 512, src_width, src_height)\n", " image = np.array(image.resize((dst_width, dst_height), resample=PIL.Image.Resampling.LANCZOS))[None, :]\n", " pad_width = 512 - dst_width\n", " pad_height = 512 - dst_height\n", " pad = ((0, 0), (0, pad_height), (0, pad_width), (0, 0))\n", " image = np.pad(image, pad, mode=\"constant\")\n", " image = image.astype(np.float32) / 255.0\n", " image = 2.0 * image - 1.0\n", " image = image.transpose(0, 3, 1, 2)\n", " return image, pad\n", "\n", "\n", "def randn_tensor(\n", " shape: Union[Tuple, List],\n", " dtype: Optional[np.dtype] = np.float32,\n", "):\n", " \"\"\"\n", " Helper function for generation random values tensor with given shape and data type\n", "\n", " Parameters:\n", " shape (Union[Tuple, List]): shape for filling random values\n", " dtype (np.dtype, *optiona*, np.float32): data type for result\n", " Returns:\n", " latents (np.ndarray): tensor with random values with given data type and shape (usually represents noise in latent space)\n", " \"\"\"\n", " latents = np.random.randn(*shape).astype(dtype)\n", "\n", " return latents\n", "\n", "\n", "class OVInstructPix2PixPipeline(DiffusionPipeline):\n", " \"\"\"\n", " OpenVINO inference pipeline for InstructPix2Pix\n", " \"\"\"\n", "\n", " def __init__(\n", " self,\n", " tokenizer: CLIPTokenizer,\n", " scheduler: EulerAncestralDiscreteScheduler,\n", " core: ov.Core,\n", " text_encoder: ov.Model,\n", " vae_encoder: ov.Model,\n", " unet: ov.Model,\n", " vae_decoder: ov.Model,\n", " device: str = \"AUTO\",\n", " ):\n", " super().__init__()\n", " self.tokenizer = tokenizer\n", " self.vae_scale_factor = 8\n", " self.scheduler = scheduler\n", " self.load_models(core, device, text_encoder, vae_encoder, unet, vae_decoder)\n", "\n", " def load_models(\n", " self,\n", " core: ov.Core,\n", " device: str,\n", " text_encoder: ov.Model,\n", " vae_encoder: ov.Model,\n", " unet: ov.Model,\n", " vae_decoder: ov.Model,\n", " ):\n", " \"\"\"\n", " Function for loading models on device using OpenVINO\n", "\n", " Parameters:\n", " core (Core): OpenVINO runtime Core class instance\n", " device (str): inference device\n", " text_encoder (Model): OpenVINO Model object represents text encoder\n", " vae_encoder (Model): OpenVINO Model object represents vae encoder\n", " unet (Model): OpenVINO Model object represents unet\n", " vae_decoder (Model): OpenVINO Model object represents vae decoder\n", " Returns\n", " None\n", " \"\"\"\n", " self.text_encoder = core.compile_model(text_encoder, device)\n", " self.text_encoder_out = self.text_encoder.output(0)\n", " ov_config = {\"INFERENCE_PRECISION_HINT\": \"f32\"} if device != \"CPU\" else {}\n", " self.vae_encoder = core.compile_model(vae_encoder, device, ov_config)\n", " self.vae_encoder_out = self.vae_encoder.output(0)\n", " # We have to register UNet in config to be able to change it externally to collect calibration data\n", " self.register_to_config(unet=core.compile_model(unet, device))\n", " self.unet_out = self.unet.output(0)\n", " self.vae_decoder = core.compile_model(vae_decoder, device, ov_config)\n", " self.vae_decoder_out = self.vae_decoder.output(0)\n", "\n", " def __call__(\n", " self,\n", " prompt: Union[str, List[str]],\n", " image: PIL.Image.Image,\n", " num_inference_steps: int = 10,\n", " guidance_scale: float = 7.5,\n", " image_guidance_scale: float = 1.5,\n", " eta: float = 0.0,\n", " latents: Optional[np.array] = None,\n", " output_type: Optional[str] = \"pil\",\n", " ):\n", " \"\"\"\n", " Function invoked when calling the pipeline for generation.\n", "\n", " Parameters:\n", " prompt (`str` or `List[str]`):\n", " The prompt or prompts to guide the image generation.\n", " image (`PIL.Image.Image`):\n", " `Image`, or tensor representing an image batch which will be repainted according to `prompt`.\n", " num_inference_steps (`int`, *optional*, defaults to 100):\n", " The number of denoising steps. More denoising steps usually lead to a higher quality image at the\n", " expense of slower inference.\n", " guidance_scale (`float`, *optional*, defaults to 7.5):\n", " Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).\n", " `guidance_scale` is defined as `w` of equation 2. of [Imagen\n", " Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >\n", " 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,\n", " usually at the expense of lower image quality. This pipeline requires a value of at least `1`.\n", " image_guidance_scale (`float`, *optional*, defaults to 1.5):\n", " Image guidance scale is to push the generated image towards the inital image `image`. Image guidance\n", " scale is enabled by setting `image_guidance_scale > 1`. Higher image guidance scale encourages to\n", " generate images that are closely linked to the source image `image`, usually at the expense of lower\n", " image quality. This pipeline requires a value of at least `1`.\n", " latents (`torch.FloatTensor`, *optional*):\n", " Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image\n", " generation. Can be used to tweak the same generation with different prompts. If not provided, a latents\n", " tensor will ge generated by sampling using the supplied random `generator`.\n", " output_type (`str`, *optional*, defaults to `\"pil\"`):\n", " The output format of the generate image. Choose between\n", " [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.\n", " Returns:\n", " image ([List[Union[np.ndarray, PIL.Image.Image]]): generaited images\n", "\n", " \"\"\"\n", "\n", " # 1. Define call parameters\n", " batch_size = 1 if isinstance(prompt, str) else len(prompt)\n", " # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)\n", " # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`\n", " # corresponds to doing no classifier free guidance.\n", " do_classifier_free_guidance = guidance_scale > 1.0 and image_guidance_scale >= 1.0\n", " # check if scheduler is in sigmas space\n", " scheduler_is_in_sigma_space = hasattr(self.scheduler, \"sigmas\")\n", "\n", " # 2. Encode input prompt\n", " text_embeddings = self._encode_prompt(prompt)\n", "\n", " # 3. Preprocess image\n", " orig_width, orig_height = image.size\n", " image, pad = preprocess(image)\n", " height, width = image.shape[-2:]\n", "\n", " # 4. set timesteps\n", " self.scheduler.set_timesteps(num_inference_steps)\n", " timesteps = self.scheduler.timesteps\n", "\n", " # 5. Prepare Image latents\n", " image_latents = self.prepare_image_latents(\n", " image,\n", " do_classifier_free_guidance=do_classifier_free_guidance,\n", " )\n", "\n", " # 6. Prepare latent variables\n", " num_channels_latents = 4\n", " latents = self.prepare_latents(\n", " batch_size,\n", " num_channels_latents,\n", " height,\n", " width,\n", " text_embeddings.dtype,\n", " latents,\n", " )\n", "\n", " # 7. Denoising loop\n", " num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order\n", " with self.progress_bar(total=num_inference_steps) as progress_bar:\n", " for i, t in enumerate(timesteps):\n", " # Expand the latents if we are doing classifier free guidance.\n", " # The latents are expanded 3 times because for pix2pix the guidance\\\n", " # is applied for both the text and the input image.\n", " latent_model_input = np.concatenate([latents] * 3) if do_classifier_free_guidance else latents\n", "\n", " # concat latents, image_latents in the channel dimension\n", " scaled_latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)\n", " scaled_latent_model_input = np.concatenate([scaled_latent_model_input, image_latents], axis=1)\n", "\n", " # predict the noise residual\n", " noise_pred = self.unet([scaled_latent_model_input, t, text_embeddings])[self.unet_out]\n", "\n", " # Hack:\n", " # For karras style schedulers the model does classifier free guidance using the\n", " # predicted_original_sample instead of the noise_pred. So we need to compute the\n", " # predicted_original_sample here if we are using a karras style scheduler.\n", " if scheduler_is_in_sigma_space:\n", " step_index = (self.scheduler.timesteps == t).nonzero().item()\n", " sigma = self.scheduler.sigmas[step_index].numpy()\n", " noise_pred = latent_model_input - sigma * noise_pred\n", "\n", " # perform guidance\n", " if do_classifier_free_guidance:\n", " noise_pred_text, noise_pred_image, noise_pred_uncond = (\n", " noise_pred[0],\n", " noise_pred[1],\n", " noise_pred[2],\n", " )\n", " noise_pred = (\n", " noise_pred_uncond\n", " + guidance_scale * (noise_pred_text - noise_pred_image)\n", " + image_guidance_scale * (noise_pred_image - noise_pred_uncond)\n", " )\n", "\n", " # For karras style schedulers the model does classifier free guidance using the\n", " # predicted_original_sample instead of the noise_pred. But the scheduler.step function\n", " # expects the noise_pred and computes the predicted_original_sample internally. So we\n", " # need to overwrite the noise_pred here such that the value of the computed\n", " # predicted_original_sample is correct.\n", " if scheduler_is_in_sigma_space:\n", " noise_pred = (noise_pred - latents) / (-sigma)\n", "\n", " # compute the previous noisy sample x_t -> x_t-1\n", " latents = self.scheduler.step(torch.from_numpy(noise_pred), t, torch.from_numpy(latents)).prev_sample.numpy()\n", "\n", " # call the callback, if provided\n", " if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):\n", " progress_bar.update()\n", "\n", " # 8. Post-processing\n", " image = self.decode_latents(latents, pad)\n", "\n", " # 9. Convert to PIL\n", " if output_type == \"pil\":\n", " image = self.numpy_to_pil(image)\n", " image = [img.resize((orig_width, orig_height), PIL.Image.Resampling.LANCZOS) for img in image]\n", " else:\n", " image = [cv2.resize(img, (orig_width, orig_width)) for img in image]\n", "\n", " return image\n", "\n", " def _encode_prompt(\n", " self,\n", " prompt: Union[str, List[str]],\n", " num_images_per_prompt: int = 1,\n", " do_classifier_free_guidance: bool = True,\n", " ):\n", " \"\"\"\n", " Encodes the prompt into text encoder hidden states.\n", "\n", " Parameters:\n", " prompt (str or list(str)): prompt to be encoded\n", " num_images_per_prompt (int): number of images that should be generated per prompt\n", " do_classifier_free_guidance (bool): whether to use classifier free guidance or not\n", " Returns:\n", " text_embeddings (np.ndarray): text encoder hidden states\n", " \"\"\"\n", " batch_size = len(prompt) if isinstance(prompt, list) else 1\n", "\n", " # tokenize input prompts\n", " text_inputs = self.tokenizer(\n", " prompt,\n", " padding=\"max_length\",\n", " max_length=self.tokenizer.model_max_length,\n", " truncation=True,\n", " return_tensors=\"np\",\n", " )\n", " text_input_ids = text_inputs.input_ids\n", "\n", " text_embeddings = self.text_encoder(text_input_ids)[self.text_encoder_out]\n", "\n", " # duplicate text embeddings for each generation per prompt, using mps friendly method\n", " if num_images_per_prompt != 1:\n", " bs_embed, seq_len, _ = text_embeddings.shape\n", " text_embeddings = np.tile(text_embeddings, (1, num_images_per_prompt, 1))\n", " text_embeddings = np.reshape(text_embeddings, (bs_embed * num_images_per_prompt, seq_len, -1))\n", "\n", " # get unconditional embeddings for classifier free guidance\n", " if do_classifier_free_guidance:\n", " uncond_tokens: List[str]\n", " uncond_tokens = [\"\"] * batch_size\n", " max_length = text_input_ids.shape[-1]\n", " uncond_input = self.tokenizer(\n", " uncond_tokens,\n", " padding=\"max_length\",\n", " max_length=max_length,\n", " truncation=True,\n", " return_tensors=\"np\",\n", " )\n", "\n", " uncond_embeddings = self.text_encoder(uncond_input.input_ids)[self.text_encoder_out]\n", "\n", " # duplicate unconditional embeddings for each generation per prompt, using mps friendly method\n", " seq_len = uncond_embeddings.shape[1]\n", " uncond_embeddings = np.tile(uncond_embeddings, (1, num_images_per_prompt, 1))\n", " uncond_embeddings = np.reshape(uncond_embeddings, (batch_size * num_images_per_prompt, seq_len, -1))\n", "\n", " # For classifier free guidance, you need to do two forward passes.\n", " # Here, you concatenate the unconditional and text embeddings into a single batch\n", " # to avoid doing two forward passes\n", " text_embeddings = np.concatenate([text_embeddings, uncond_embeddings, uncond_embeddings])\n", "\n", " return text_embeddings\n", "\n", " def prepare_image_latents(\n", " self,\n", " image,\n", " batch_size=1,\n", " num_images_per_prompt=1,\n", " do_classifier_free_guidance=True,\n", " ):\n", " \"\"\"\n", " Encodes input image to latent space using VAE Encoder\n", "\n", " Parameters:\n", " image (np.ndarray): input image tensor\n", " num_image_per_prompt (int, *optional*, 1): number of image generated for promt\n", " do_classifier_free_guidance (bool): whether to use classifier free guidance or not\n", " Returns:\n", " image_latents: image encoded to latent space\n", " \"\"\"\n", "\n", " image = image.astype(np.float32)\n", "\n", " batch_size = batch_size * num_images_per_prompt\n", " image_latents = self.vae_encoder(image)[self.vae_encoder_out]\n", "\n", " if batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] == 0:\n", " # expand image_latents for batch_size\n", " additional_image_per_prompt = batch_size // image_latents.shape[0]\n", " image_latents = np.concatenate([image_latents] * additional_image_per_prompt, axis=0)\n", " elif batch_size > image_latents.shape[0] and batch_size % image_latents.shape[0] != 0:\n", " raise ValueError(f\"Cannot duplicate `image` of batch size {image_latents.shape[0]} to {batch_size} text prompts.\")\n", " else:\n", " image_latents = np.concatenate([image_latents], axis=0)\n", "\n", " if do_classifier_free_guidance:\n", " uncond_image_latents = np.zeros_like(image_latents)\n", " image_latents = np.concatenate([image_latents, image_latents, uncond_image_latents], axis=0)\n", "\n", " return image_latents\n", "\n", " def prepare_latents(\n", " self,\n", " batch_size: int,\n", " num_channels_latents: int,\n", " height: int,\n", " width: int,\n", " dtype: np.dtype = np.float32,\n", " latents: np.ndarray = None,\n", " ):\n", " \"\"\"\n", " Preparing noise to image generation. If initial latents are not provided, they will be generated randomly,\n", " then prepared latents scaled by the standard deviation required by the scheduler\n", "\n", " Parameters:\n", " batch_size (int): input batch size\n", " num_channels_latents (int): number of channels for noise generation\n", " height (int): image height\n", " width (int): image width\n", " dtype (np.dtype, *optional*, np.float32): dtype for latents generation\n", " latents (np.ndarray, *optional*, None): initial latent noise tensor, if not provided will be generated\n", " Returns:\n", " latents (np.ndarray): scaled initial noise for diffusion\n", " \"\"\"\n", " shape = (\n", " batch_size,\n", " num_channels_latents,\n", " height // self.vae_scale_factor,\n", " width // self.vae_scale_factor,\n", " )\n", " if latents is None:\n", " latents = randn_tensor(shape, dtype=dtype)\n", " else:\n", " latents = latents\n", "\n", " # scale the initial noise by the standard deviation required by the scheduler\n", " latents = latents * self.scheduler.init_noise_sigma.numpy()\n", " return latents\n", "\n", " def decode_latents(self, latents: np.array, pad: Tuple[int]):\n", " \"\"\"\n", " Decode predicted image from latent space using VAE Decoder and unpad image result\n", "\n", " Parameters:\n", " latents (np.ndarray): image encoded in diffusion latent space\n", " pad (Tuple[int]): each side padding sizes obtained on preprocessing step\n", " Returns:\n", " image: decoded by VAE decoder image\n", " \"\"\"\n", " latents = 1 / 0.18215 * latents\n", " image = self.vae_decoder(latents)[self.vae_decoder_out]\n", " (_, end_h), (_, end_w) = pad[1:3]\n", " h, w = image.shape[2:]\n", " unpad_h = h - end_h\n", " unpad_w = w - end_w\n", " image = image[:, :, :unpad_h, :unpad_w]\n", " image = np.clip(image / 2 + 0.5, 0, 1)\n", " image = np.transpose(image, (0, 2, 3, 1))\n", " return image" ] }, { "cell_type": "code", "execution_count": 7, "id": "4adbd225-5920-43f0-b856-81c9f4c56e8d", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "\n", "def visualize_results(\n", " orig_img: PIL.Image.Image,\n", " processed_img: PIL.Image.Image,\n", " img1_title: str,\n", " img2_title: str,\n", "):\n", " \"\"\"\n", " Helper function for results visualization\n", "\n", " Parameters:\n", " orig_img (PIL.Image.Image): original image\n", " processed_img (PIL.Image.Image): processed image after editing\n", " img1_title (str): title for the image on the left\n", " img2_title (str): title for the image on the right\n", " Returns:\n", " fig (matplotlib.pyplot.Figure): matplotlib generated figure contains drawing result\n", " \"\"\"\n", " im_w, im_h = orig_img.size\n", " is_horizontal = im_h <= im_w\n", " figsize = (20, 30) if is_horizontal else (30, 20)\n", " fig, axs = plt.subplots(\n", " 1 if is_horizontal else 2,\n", " 2 if is_horizontal else 1,\n", " figsize=figsize,\n", " sharex=\"all\",\n", " sharey=\"all\",\n", " )\n", " fig.patch.set_facecolor(\"white\")\n", " list_axes = list(axs.flat)\n", " for a in list_axes:\n", " a.set_xticklabels([])\n", " a.set_yticklabels([])\n", " a.get_xaxis().set_visible(False)\n", " a.get_yaxis().set_visible(False)\n", " a.grid(False)\n", " list_axes[0].imshow(np.array(orig_img))\n", " list_axes[1].imshow(np.array(processed_img))\n", " list_axes[0].set_title(img1_title, fontsize=20)\n", " list_axes[1].set_title(img2_title, fontsize=20)\n", " fig.subplots_adjust(wspace=0.0 if is_horizontal else 0.01, hspace=0.01 if is_horizontal else 0.0)\n", " fig.tight_layout()\n", " fig.savefig(\"result.png\", bbox_inches=\"tight\")\n", " return fig" ] }, { "cell_type": "markdown", "id": "8b50e5f5-0fe1-4d7a-9322-5226260287f2", "metadata": {}, "source": [ "Model tokenizer and scheduler are also important parts of the pipeline. Let us define them and put all components together.\n", "Additionally, you can provide device selecting one from available in dropdown list." ] }, { "cell_type": "code", "execution_count": 8, "id": "91ae1769-10fd-437d-b1c4-73edfd2d584c", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ff26c8cc3c624dffad2314fbbdfb434a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "cell_type": "code", "execution_count": 9, "id": "c04de88d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ltalamanova/env_ci/lib/python3.8/site-packages/diffusers/configuration_utils.py:134: FutureWarning: Accessing config attribute `unet` directly via 'OVInstructPix2PixPipeline' object attribute is deprecated. Please access 'unet' over 'OVInstructPix2PixPipeline's config object instead, e.g. 'scheduler.config.unet'.\n", " deprecate(\"direct config name access\", \"1.0.0\", deprecation_message, standard_warn=False)\n" ] } ], "source": [ "from transformers import CLIPTokenizer\n", "\n", "tokenizer = CLIPTokenizer.from_pretrained(\"openai/clip-vit-large-patch14\")\n", "scheduler = EulerAncestralDiscreteScheduler.from_config(scheduler_config)\n", "\n", "ov_pipe = OVInstructPix2PixPipeline(\n", " tokenizer,\n", " scheduler,\n", " core,\n", " TEXT_ENCODER_OV_PATH,\n", " VAE_ENCODER_OV_PATH,\n", " UNET_OV_PATH,\n", " VAE_DECODER_OV_PATH,\n", " device=device.value,\n", ")" ] }, { "cell_type": "markdown", "id": "4f83a572-16fe-48ef-b468-7a9668ac8d94", "metadata": {}, "source": [ "Now, you are ready to define editing instructions and an image for running the inference pipeline. You can find example results generated by the model on this [page](https://www.timothybrooks.com/instruct-pix2pix/), in case you need inspiration.\n", "Optionally, you can also change the random generator seed for latent state initialization and number of steps.\n", "\n", "> **Note**: Consider increasing `steps` to get more precise results. A suggested value is `100`, but it will take more time to process." ] }, { "cell_type": "code", "execution_count": 10, "id": "2912cfe8", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "690e2028edf44e7e867cf75da3ca9294", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(Text(value=' Make it in galaxy', description='your text'), IntSlider(value=42, description='see…" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "style = {\"description_width\": \"initial\"}\n", "text_prompt = widgets.Text(value=\" Make it in galaxy\", description=\"your text\")\n", "num_steps = widgets.IntSlider(min=1, max=100, value=10, description=\"steps:\")\n", "seed = widgets.IntSlider(min=0, max=1024, description=\"seed: \", value=42)\n", "image_widget = widgets.FileUpload(accept=\"\", multiple=False, description=\"Upload image\", style=style)\n", "widgets.VBox([text_prompt, seed, num_steps, image_widget])" ] }, { "cell_type": "markdown", "id": "a41f30c2-ac22-4907-b16c-92610e4a0ab6", "metadata": {}, "source": [ "> **Note**: Diffusion process can take some time, depending on what hardware you select." ] }, { "cell_type": "code", "execution_count": 11, "id": "b33a9e7f-bee0-424a-8fc3-e0cdda0146f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pipeline settings\n", "Input text: Make it in galaxy\n", "Seed: 42\n", "Number of steps: 10\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ebc226dd2a74474195baed27fb59784b", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/10 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = visualize_results(\n", " image,\n", " processed_image[0],\n", " img1_title=\"Original image\",\n", " img2_title=f\"Prompt: {text_prompt.value}\",\n", ")" ] }, { "cell_type": "markdown", "id": "e90d66b9-ebc0-4020-ba6e-4076b0bde746", "metadata": {}, "source": [ "Nice. As you can see, the picture has quite a high definition 🔥." ] }, { "cell_type": "markdown", "id": "867aa235", "metadata": {}, "source": [ "## Quantization\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in `INT8` instead of `FP32`/`FP16` making model inference faster.\n", "\n", "According to `InstructPix2Pix` pipeline structure, UNet used for iterative denoising of input. It means that model runs in the cycle repeating inference on each diffusion step, while other parts of pipeline take part only once. That is why computation cost and speed of UNet denoising becomes the critical path in the pipeline.\n", "\n", "The optimization process contains the following steps:\n", "\n", "1. Create a calibration dataset for quantization.\n", "2. Run `nncf.quantize()` to obtain quantized model.\n", "3. Save the `INT8` model using `openvino.save_model()` function.\n", "\n", "Please select below whether you would like to run quantization to improve model inference speed." ] }, { "cell_type": "code", "execution_count": 13, "id": "1b5eeb3c", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9fe221a0f7624fc1b99240a7d8e89eac", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Checkbox(value=True, description='Quantization')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to_quantize = widgets.Checkbox(\n", " value=True,\n", " description=\"Quantization\",\n", " disabled=False,\n", ")\n", "\n", "to_quantize" ] }, { "cell_type": "markdown", "id": "960382a8", "metadata": {}, "source": [ "Let's load `skip magic` extension to skip quantization if `to_quantize` is not selected" ] }, { "cell_type": "code", "execution_count": 14, "id": "8c737457", "metadata": {}, "outputs": [], "source": [ "# Fetch `skip_kernel_extension` module\n", "import requests\n", "\n", "r = requests.get(\n", " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py\",\n", ")\n", "open(\"skip_kernel_extension.py\", \"w\").write(r.text)\n", "\n", "%load_ext skip_kernel_extension" ] }, { "cell_type": "markdown", "id": "9a30b1ef", "metadata": {}, "source": [ "### Prepare calibration dataset\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "We use a portion of [`fusing/instructpix2pix-1000-samples`](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) dataset from Hugging Face as calibration data.\n", "To collect intermediate model inputs for calibration we should customize `CompiledModel`." ] }, { "cell_type": "code", "execution_count": 15, "id": "ab6507d9", "metadata": {}, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "import datasets\n", "from tqdm.notebook import tqdm\n", "from transformers import Pipeline\n", "from typing import Any, Dict, List\n", "\n", "class CompiledModelDecorator(ov.CompiledModel):\n", " def __init__(self, compiled_model, prob: float, data_cache: List[Any] = None):\n", " super().__init__(compiled_model)\n", " self.data_cache = data_cache if data_cache else []\n", " self.prob = np.clip(prob, 0, 1)\n", "\n", " def __call__(self, *args, **kwargs):\n", " if np.random.rand() >= self.prob:\n", " self.data_cache.append(*args)\n", " return super().__call__(*args, **kwargs)\n", "\n", "def collect_calibration_data(pix2pix_pipeline: Pipeline, subset_size: int) -> List[Dict]:\n", " original_unet = pix2pix_pipeline.unet\n", " pix2pix_pipeline.unet = CompiledModelDecorator(original_unet, prob=0.3)\n", " dataset = datasets.load_dataset(\"fusing/instructpix2pix-1000-samples\", split=\"train\", streaming=True).shuffle(seed=42)\n", " pix2pix_pipeline.set_progress_bar_config(disable=True)\n", "\n", " # Run inference for data collection\n", " pbar = tqdm(total=subset_size)\n", " diff = 0\n", " for batch in dataset:\n", " prompt = batch[\"edit_prompt\"]\n", " image = batch[\"input_image\"].convert(\"RGB\")\n", " _ = pix2pix_pipeline(prompt, image)\n", " collected_subset_size = len(pix2pix_pipeline.unet.data_cache)\n", " if collected_subset_size >= subset_size:\n", " pbar.update(subset_size - pbar.n)\n", " break\n", " pbar.update(collected_subset_size - diff)\n", " diff = collected_subset_size\n", "\n", " calibration_dataset = pix2pix_pipeline.unet.data_cache\n", " pix2pix_pipeline.set_progress_bar_config(disable=False)\n", " pix2pix_pipeline.unet = original_unet\n", " return calibration_dataset" ] }, { "cell_type": "code", "execution_count": 17, "id": "5b4b6944", "metadata": { "test_replace": {"subset_size = 300": "subset_size = 10"} }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ltalamanova/env_ci/lib/python3.8/site-packages/diffusers/configuration_utils.py:134: FutureWarning: Accessing config attribute `unet` directly via 'OVInstructPix2PixPipeline' object attribute is deprecated. Please access 'unet' over 'OVInstructPix2PixPipeline's config object instead, e.g. 'scheduler.config.unet'.\n", " deprecate(\"direct config name access\", \"1.0.0\", deprecation_message, standard_warn=False)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "60d8f321124a45b18a0970b76ad0b189", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/300 [00:00 **NOTE**: Quantization is time and memory consuming operation. Running quantization code below may take some time." ] }, { "cell_type": "code", "execution_count": 18, "id": "82fd3ba9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████| 300/300 [06:48<00:00, 1.36s/it]\n", "Applying Smooth Quant: 100%|██████████| 100/100 [00:07<00:00, 13.51it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:96 ignored nodes was found by name in the NNCFGraph\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████| 300/300 [14:34<00:00, 2.91s/it]\n", "Applying Fast Bias correction: 100%|██████████| 186/186 [05:31<00:00, 1.78s/it]\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "import nncf\n", "\n", "if UNET_INT8_OV_PATH.exists():\n", " print(\"Loading quantized model\")\n", " quantized_unet = core.read_model(UNET_INT8_OV_PATH)\n", "else:\n", " unet = core.read_model(UNET_OV_PATH)\n", " quantized_unet = nncf.quantize(\n", " model=unet,\n", " subset_size=subset_size,\n", " calibration_dataset=nncf.Dataset(unet_calibration_data),\n", " model_type=nncf.ModelType.TRANSFORMER\n", " )\n", " ov.save_model(quantized_unet, UNET_INT8_OV_PATH)" ] }, { "cell_type": "markdown", "id": "c851ce55", "metadata": {}, "source": [ "Let us check predictions with the quantized UNet using the same input data." ] }, { "cell_type": "code", "execution_count": 19, "id": "027e65aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pipeline settings\n", "Input text: Make it in galaxy\n", "Seed: 42\n", "Number of steps: 10\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c593cddb53e14752999a19bf363bc409", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/10 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "print('Pipeline settings')\n", "print(f'Input text: {text_prompt.value}')\n", "print(f'Seed: {seed.value}')\n", "print(f'Number of steps: {num_steps.value}')\n", "np.random.seed(seed.value)\n", "\n", "int8_pipe = OVInstructPix2PixPipeline(tokenizer, scheduler, core, TEXT_ENCODER_OV_PATH, VAE_ENCODER_OV_PATH, UNET_INT8_OV_PATH, VAE_DECODER_OV_PATH, device=device.value)\n", "int8_processed_image = int8_pipe(text_prompt.value, image, num_steps.value)\n", "\n", "fig = visualize_results(processed_image[0], int8_processed_image[0], img1_title=\"FP16 result\", img2_title=\"INT8 result\")" ] }, { "cell_type": "markdown", "id": "5eb64dee", "metadata": {}, "source": [ "### Compare inference time of the FP16 and INT8 models\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "To measure the inference performance of the `FP16` and `INT8` models, we use median inference time on calibration subset.\n", "\n", "> **NOTE**: For the most accurate performance estimation, it is recommended to run `benchmark_app` in a terminal/command prompt after closing other applications." ] }, { "cell_type": "code", "execution_count": 20, "id": "50c073be", "metadata": {}, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "import time\n", "\n", "calibration_dataset = datasets.load_dataset(\"fusing/instructpix2pix-1000-samples\", split=\"train\", streaming=True)\n", "validation_data = []\n", "validation_size = 10\n", "while len(validation_data) < validation_size:\n", " batch = next(iter(calibration_dataset))\n", " prompt = batch[\"edit_prompt\"]\n", " input_image = batch[\"input_image\"].convert(\"RGB\")\n", " validation_data.append((prompt, input_image))\n", "\n", "def calculate_inference_time(pix2pix_pipeline, calibration_dataset, size=10):\n", " inference_time = []\n", " pix2pix_pipeline.set_progress_bar_config(disable=True)\n", " for (prompt, image) in calibration_dataset:\n", " start = time.perf_counter()\n", " _ = pix2pix_pipeline(prompt, image)\n", " end = time.perf_counter()\n", " delta = end - start\n", " inference_time.append(delta)\n", " return np.median(inference_time)" ] }, { "cell_type": "code", "execution_count": 21, "id": "2c0bbdb3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Performance speed up: 1.437\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "fp_latency = calculate_inference_time(ov_pipe, validation_data)\n", "int8_latency = calculate_inference_time(int8_pipe, validation_data)\n", "print(f\"Performance speed up: {fp_latency / int8_latency:.3f}\")" ] }, { "cell_type": "markdown", "id": "3d6f7968-e8b0-4066-a481-64dcb01723f2", "metadata": {}, "source": [ "## Interactive demo with Gradio\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "markdown", "id": "b18e04e5", "metadata": {}, "source": [ "> **Note**: Diffusion process can take some time, depending on what hardware you select." ] }, { "cell_type": "code", "execution_count": 22, "id": "2c532e0f", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1327946dc52f401e95951fc3fa8ddfdd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Precision:', options=('FP16', 'INT8'), value='FP16')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe_precision = widgets.Dropdown(\n", " options=[\"FP16\"] if not to_quantize.value else [\"FP16\", \"INT8\"],\n", " value=\"FP16\",\n", " description=\"Precision:\",\n", " disabled=False,\n", ")\n", "\n", "pipe_precision" ] }, { "cell_type": "code", "execution_count": null, "id": "27f67e4b-17f5-4fa0-92aa-19af956d1f58", "metadata": { "test_replace": { " demo.queue().launch(debug=True)": " demo.queue().launch()", " demo.queue().launch(share=True, debug=True)": " demo.queue().launch(share=True)" } }, "outputs": [], "source": [ "import gradio as gr\n", "from pathlib import Path\n", "import numpy as np\n", "\n", "default_url = \"https://user-images.githubusercontent.com/29454499/223343459-4ac944f0-502e-4acf-9813-8e9f0abc8a16.jpg\"\n", "path = Path(\"data/example.jpg\")\n", "path.parent.mkdir(parents=True, exist_ok=True)\n", "\n", "r = requests.get(default_url)\n", "\n", "with path.open(\"wb\") as f:\n", " f.write(r.content)\n", "\n", "pipeline = int8_pipe if pipe_precision.value == \"INT8\" else ov_pipe\n", "\n", "\n", "def generate(img, text, seed, num_steps, _=gr.Progress(track_tqdm=True)):\n", " if img is None:\n", " raise gr.Error(\"Please upload an image or choose one from the examples list\")\n", " np.random.seed(seed)\n", " result = pipeline(text, img, num_steps)[0]\n", " return result\n", "\n", "\n", "demo = gr.Interface(\n", " generate,\n", " [\n", " gr.Image(label=\"Image\", type=\"pil\"),\n", " gr.Textbox(label=\"Text\"),\n", " gr.Slider(0, 1024, label=\"Seed\", value=42),\n", " gr.Slider(\n", " 1,\n", " 100,\n", " label=\"Steps\",\n", " value=10,\n", " info=\"Consider increasing the value to get more precise results. A suggested value is 100, but it will take more time to process.\",\n", " ),\n", " ],\n", " gr.Image(label=\"Result\"),\n", " examples=[[path, \"Make it in galaxy\"]],\n", ")\n", "\n", "try:\n", " demo.queue().launch(debug=True)\n", "except Exception:\n", " demo.queue().launch(share=True, debug=True)\n", "# if you are launching remotely, specify server_name and server_port\n", "# demo.launch(server_name='your server name', server_port='server port in int')\n", "# Read more in the docs: https://gradio.app/docs/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/instruct-pix2pix-image-editing/instruct-pix2pix-image-editing.png?raw=true", "tags": { "categories": [ "Model Demos", "AI Trends" ], "libraries": [], "other": [ "Stable Diffusion" ], "tasks": [ "Image-to-Image" ] } }, "vscode": { "interpreter": { "hash": "cec18e25feb9469b5ff1085a8097bdcd86db6a4ac301d6aeff87d0f3e7ce4ca5" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }