{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "cacbe6b4", "metadata": { "id": "rQc-wXjqrEuR" }, "source": [ "# Quantize NLP models with Post-Training Quantization ​in NNCF\n", "This tutorial demonstrates how to apply `INT8` quantization to the Natural Language Processing model known as [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), using the [Post-Training Quantization API](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/quantizing-models-post-training/basic-quantization-flow.html) (NNCF library). A fine-tuned [HuggingFace BERT](https://huggingface.co/transformers/model_doc/bert.html) [PyTorch](https://pytorch.org/) model, trained on the [Microsoft Research Paraphrase Corpus (MRPC)](https://www.microsoft.com/en-us/download/details.aspx?id=52398), will be used. The tutorial is designed to be extendable to custom models and datasets. It consists of the following steps:\n", "\n", "- Download and prepare the BERT model and MRPC dataset.\n", "- Define data loading and accuracy validation functionality.\n", "- Prepare the model for quantization.\n", "- Run optimization pipeline.\n", "- Load and test quantized model.\n", "- Compare the performance of the original, converted and quantized models.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d89f8a6b", "metadata": {}, "source": [ "\n", "#### Table of contents:\n", "\n", "- [Imports](#Imports)\n", "- [Settings](#Settings)\n", "- [Prepare the Model](#Prepare-the-Model)\n", "- [Prepare the Dataset](#Prepare-the-Dataset)\n", "- [Optimize model using NNCF Post-training Quantization API](#Optimize-model-using-NNCF-Post-training-Quantization-API)\n", "- [Load and Test OpenVINO Model](#Load-and-Test-OpenVINO-Model)\n", " - [Select inference device](#Select-inference-device)\n", "- [Compare F1-score of FP32 and INT8 models](#Compare-F1-score-of-FP32-and-INT8-models)\n", "- [Compare Performance of the Original, Converted and Quantized Models](#Compare-Performance-of-the-Original,-Converted-and-Quantized-Models)\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "694d9fc1-501c-4b86-a747-637e2aad64ba", "metadata": {}, "outputs": [], "source": [ "%pip install -q \"nncf>=2.5.0\"\n", "%pip install -q torch transformers \"torch>=2.1\" datasets evaluate tqdm --extra-index-url https://download.pytorch.org/whl/cpu\n", "%pip install -q \"openvino>=2023.1.0\"" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4d6b41e6-132b-40da-b3b9-91bacba29e31", "metadata": {}, "source": [ "## Imports\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "771388d6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-07-10 09:01:29.708173: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "2023-07-10 09:01:29.872021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2023-07-10 09:01:30.707194: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino\n" ] } ], "source": [ "import os\n", "import time\n", "from pathlib import Path\n", "from zipfile import ZipFile\n", "from typing import Iterable\n", "from typing import Any\n", "\n", "import datasets\n", "import evaluate\n", "import numpy as np\n", "import nncf\n", "from nncf.parameters import ModelType\n", "import openvino as ov\n", "import torch\n", "from transformers import BertForSequenceClassification, BertTokenizer\n", "\n", "# Fetch `notebook_utils` module\n", "import requests\n", "\n", "r = requests.get(\n", " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n", ")\n", "\n", "open(\"notebook_utils.py\", \"w\").write(r.text)\n", "from notebook_utils import download_file" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e9e66896-d439-4065-868a-65b44d31525a", "metadata": {}, "source": [ "## Settings\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "284e9a4b", "metadata": {}, "outputs": [], "source": [ "# Set the data and model directories, source URL and the filename of the model.\n", "DATA_DIR = \"data\"\n", "MODEL_DIR = \"model\"\n", "MODEL_LINK = \"https://download.pytorch.org/tutorial/MRPC.zip\"\n", "FILE_NAME = MODEL_LINK.split(\"/\")[-1]\n", "PRETRAINED_MODEL_DIR = os.path.join(MODEL_DIR, \"MRPC\")\n", "\n", "os.makedirs(DATA_DIR, exist_ok=True)\n", "os.makedirs(MODEL_DIR, exist_ok=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "44dc335d", "metadata": { "id": "YytHDzLE0uOJ", "pycharm": { "name": "#%% md\n" } }, "source": [ "## Prepare the Model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Perform the following:\n", "\n", "- Download and unpack pre-trained BERT model for MRPC by PyTorch.\n", "- Convert the model to the OpenVINO Intermediate Representation (OpenVINO IR)" ] }, { "cell_type": "code", "execution_count": 4, "id": "be9fc64c", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7436f7d07d434a4db799d27084446df9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "model/MRPC.zip: 0%| | 0.00/387M [00:00 float:\n", " \"\"\"\n", " Evaluate the model on GLUE dataset.\n", " Returns F1 score metric.\n", " \"\"\"\n", " compiled_model = core.compile_model(model, device_name=device.value)\n", " output_layer = compiled_model.output(0)\n", "\n", " metric = evaluate.load(\"glue\", \"mrpc\")\n", " for batch in dataset:\n", " inputs = [np.expand_dims(np.asarray(batch[key], dtype=np.int64), 0) for key in INPUT_NAMES]\n", " outputs = compiled_model(inputs)[output_layer]\n", " predictions = outputs[0].argmax(axis=-1)\n", " metric.add_batch(predictions=[predictions], references=[batch[\"labels\"]])\n", " metrics = metric.compute()\n", " f1_score = metrics[\"f1\"]\n", "\n", " return f1_score\n", "\n", "\n", "print(\"Checking the accuracy of the original model:\")\n", "metric = validate(model, data_source)\n", "print(f\"F1 score: {metric:.4f}\")\n", "\n", "print(\"Checking the accuracy of the quantized model:\")\n", "metric = validate(quantized_model, data_source)\n", "print(f\"F1 score: {metric:.4f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4f67f6a2", "metadata": { "id": "vQACMfAUo52V", "tags": [] }, "source": [ "## Compare Performance of the Original, Converted and Quantized Models\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Compare the original PyTorch model with OpenVINO converted and quantized models (`FP32`, `INT8`) to see the difference in performance. It is expressed in Sentences Per Second (SPS) measure, which is the same as Frames Per Second (FPS) for images." ] }, { "cell_type": "code", "execution_count": 13, "id": "734ae69a", "metadata": {}, "outputs": [], "source": [ "# Compile the model for a specific device.\n", "compiled_model = core.compile_model(model=model, device_name=device.value)" ] }, { "cell_type": "code", "execution_count": 14, "id": "f484fff2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PyTorch model on CPU: 0.080 seconds per sentence, SPS: 12.47\n", "IR FP32 model in OpenVINO Runtime/AUTO: 0.024 seconds per sentence, SPS: 41.92\n", "OpenVINO IR INT8 model in OpenVINO Runtime/AUTO: 0.012 seconds per sentence, SPS: 84.38\n" ] } ], "source": [ "num_samples = 50\n", "sample = data_source[0]\n", "inputs = {k: torch.unsqueeze(torch.tensor(sample[k]), 0) for k in [\"input_ids\", \"token_type_ids\", \"attention_mask\"]}\n", "\n", "with torch.no_grad():\n", " start = time.perf_counter()\n", " for _ in range(num_samples):\n", " torch_model(torch.vstack(list(inputs.values())))\n", " end = time.perf_counter()\n", " time_torch = end - start\n", "print(f\"PyTorch model on CPU: {time_torch / num_samples:.3f} seconds per sentence, \" f\"SPS: {num_samples / time_torch:.2f}\")\n", "\n", "start = time.perf_counter()\n", "for _ in range(num_samples):\n", " compiled_model(inputs)\n", "end = time.perf_counter()\n", "time_ir = end - start\n", "print(f\"IR FP32 model in OpenVINO Runtime/{device.value}: {time_ir / num_samples:.3f} \" f\"seconds per sentence, SPS: {num_samples / time_ir:.2f}\")\n", "\n", "start = time.perf_counter()\n", "for _ in range(num_samples):\n", " compiled_quantized_model(inputs)\n", "end = time.perf_counter()\n", "time_ir = end - start\n", "print(f\"OpenVINO IR INT8 model in OpenVINO Runtime/{device.value}: {time_ir / num_samples:.3f} \" f\"seconds per sentence, SPS: {num_samples / time_ir:.2f}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "add78af0", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Finally, measure the inference performance of OpenVINO `FP32` and `INT8` models. For this purpose, use [Benchmark Tool](https://docs.openvino.ai/2024/learn-openvino/openvino-samples/benchmark-tool.html) in OpenVINO.\n", "\n", "> **Note**: The `benchmark_app` tool is able to measure the performance of the OpenVINO Intermediate Representation (OpenVINO IR) models only. For more accurate performance, run `benchmark_app` in a terminal/command prompt after closing other applications. Run `benchmark_app -m model.xml -d CPU` to benchmark async inference on CPU for one minute. Change `CPU` to `GPU` to benchmark on GPU. Run `benchmark_app --help` to see an overview of all command-line options." ] }, { "cell_type": "code", "execution_count": 17, "id": "f71b38a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Step 1/11] Parsing and validating input arguments\n", "[ INFO ] Parsing input parameters\n", "[Step 2/11] Loading OpenVINO Runtime\n", "[ WARNING ] Default duration 120 seconds is used for unknown device device.value\n", "[ INFO ] OpenVINO:\n", "[ INFO ] Build ................................. 2023.0.0-10926-b4452d56304-releases/2023/0\n", "[ INFO ] \n", "[ INFO ] Device info:\n", "[ ERROR ] Check 'false' failed at src/inference/src/core.cpp:84:\n", "Device with \"device\" name is not registered in the OpenVINO Runtime\n", "Traceback (most recent call last):\n", " File \"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/openvino/tools/benchmark/main.py\", line 103, in main\n", " benchmark.print_version_info()\n", " File \"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/openvino/tools/benchmark/benchmark.py\", line 48, in print_version_info\n", " for device, version in self.core.get_versions(self.device).items():\n", "RuntimeError: Check 'false' failed at src/inference/src/core.cpp:84:\n", "Device with \"device\" name is not registered in the OpenVINO Runtime\n", "\n" ] } ], "source": [ "# Inference FP32 model (OpenVINO IR)\n", "!benchmark_app -m $ir_model_xml -shape [1,128],[1,128],[1,128] -d {device.value} -api sync" ] }, { "cell_type": "code", "execution_count": 16, "id": "fdf41525", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Step 1/11] Parsing and validating input arguments\n", "[ INFO ] Parsing input parameters\n", "[Step 2/11] Loading OpenVINO Runtime\n", "[ WARNING ] Default duration 120 seconds is used for unknown device device.value\n", "[ INFO ] OpenVINO:\n", "[ INFO ] Build ................................. 2023.0.0-10926-b4452d56304-releases/2023/0\n", "[ INFO ] \n", "[ INFO ] Device info:\n", "[ ERROR ] Check 'false' failed at src/inference/src/core.cpp:84:\n", "Device with \"device\" name is not registered in the OpenVINO Runtime\n", "Traceback (most recent call last):\n", " File \"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/openvino/tools/benchmark/main.py\", line 103, in main\n", " benchmark.print_version_info()\n", " File \"/home/ea/work/notebooks_convert/notebooks_conv_env/lib/python3.8/site-packages/openvino/tools/benchmark/benchmark.py\", line 48, in print_version_info\n", " for device, version in self.core.get_versions(self.device).items():\n", "RuntimeError: Check 'false' failed at src/inference/src/core.cpp:84:\n", "Device with \"device\" name is not registered in the OpenVINO Runtime\n", "\n" ] } ], "source": [ "# Inference INT8 model (OpenVINO IR)\n", "! benchmark_app -m $compressed_model_xml -shape [1,128],[1,128],[1,128] -d {device.value} -api sync" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "", "tags": { "categories": [ "API Overview", "Optimize" ], "libraries": [], "other": [], "tasks": [ "Text Classification" ] } }, "vscode": { "interpreter": { "hash": "cec18e25feb9469b5ff1085a8097bdcd86db6a4ac301d6aeff87d0f3e7ce4ca5" } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }