{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "28e3460d-59b1-4d4c-b62e-510987fb2f28",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "This notebook is to show how to launch the TGI Benchmark tool. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0de3cc9-c6cd-45b3-9dd0-84b3cb2fc8b2",
   "metadata": {},
   "source": [
    "Here we can see the different settings for TGI Benchmark. \n",
    "\n",
    "Here are some of the more important ones:\n",
    "\n",
    "- `--tokenizer-name` This is required so the tool knows what tokenizer to use\n",
    "- `--batch-size` This is important for load testing. We should use more and more values to see what happens to throughput and latency\n",
    "- `--sequence-length` AKA input tokens, it is important to match your use-case needs\n",
    "- `--decode-length` AKA output tokens, it is important to match your use-case needs\n",
    "- `--runs` 10 is the default\n",
    "\n",
    "<blockquote style=\"border-left: 5px solid #80CBC4; background: #263238; color: #CFD8DC; padding: 0.5em 1em; margin: 1em 0;\">\n",
    "  <strong>💡 Tip:</strong> Use a low number for <code style=\"background: #37474F; color: #FFFFFF; padding: 2px 4px; border-radius: 4px;\">--runs</code> when you are exploring but a higher number as you finalize to get more precise statistics\n",
    "</blockquote>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "694df6d6-a521-4dab-977b-2828d4250781",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text Generation Benchmarking tool\n",
      "\n",
      "\u001b[1m\u001b[4mUsage:\u001b[0m \u001b[1mtext-generation-benchmark\u001b[0m [OPTIONS] \u001b[1m--tokenizer-name\u001b[0m <TOKENIZER_NAME>\n",
      "\n",
      "\u001b[1m\u001b[4mOptions:\u001b[0m\n",
      "  \u001b[1m-t\u001b[0m, \u001b[1m--tokenizer-name\u001b[0m <TOKENIZER_NAME>\n",
      "          The name of the tokenizer (as in model_id on the huggingface hub, or local path) [env: TOKENIZER_NAME=]\n",
      "      \u001b[1m--revision\u001b[0m <REVISION>\n",
      "          The revision to use for the tokenizer if on the hub [env: REVISION=] [default: main]\n",
      "  \u001b[1m-b\u001b[0m, \u001b[1m--batch-size\u001b[0m <BATCH_SIZE>\n",
      "          The various batch sizes to benchmark for, the idea is to get enough batching to start seeing increased latency, this usually means you're moving from memory bound (usual as BS=1) to compute bound, and this is a sweet spot for the maximum batch size for the model under test\n",
      "  \u001b[1m-s\u001b[0m, \u001b[1m--sequence-length\u001b[0m <SEQUENCE_LENGTH>\n",
      "          This is the initial prompt sent to the text-generation-server length in token. Longer prompt will slow down the benchmark. Usually the latency grows somewhat linearly with this for the prefill step [env: SEQUENCE_LENGTH=] [default: 10]\n",
      "  \u001b[1m-d\u001b[0m, \u001b[1m--decode-length\u001b[0m <DECODE_LENGTH>\n",
      "          This is how many tokens will be generated by the server and averaged out to give the `decode` latency. This is the *critical* number you want to optimize for LLM spend most of their time doing decoding [env: DECODE_LENGTH=] [default: 8]\n",
      "  \u001b[1m-r\u001b[0m, \u001b[1m--runs\u001b[0m <RUNS>\n",
      "          How many runs should we average from [env: RUNS=] [default: 10]\n",
      "  \u001b[1m-w\u001b[0m, \u001b[1m--warmups\u001b[0m <WARMUPS>\n",
      "          Number of warmup cycles [env: WARMUPS=] [default: 1]\n",
      "  \u001b[1m-m\u001b[0m, \u001b[1m--master-shard-uds-path\u001b[0m <MASTER_SHARD_UDS_PATH>\n",
      "          The location of the grpc socket. This benchmark tool bypasses the router completely and directly talks to the gRPC processes [env: MASTER_SHARD_UDS_PATH=] [default: /tmp/text-generation-server-0]\n",
      "      \u001b[1m--temperature\u001b[0m <TEMPERATURE>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TEMPERATURE=]\n",
      "      \u001b[1m--top-k\u001b[0m <TOP_K>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_K=]\n",
      "      \u001b[1m--top-p\u001b[0m <TOP_P>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_P=]\n",
      "      \u001b[1m--typical-p\u001b[0m <TYPICAL_P>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TYPICAL_P=]\n",
      "      \u001b[1m--repetition-penalty\u001b[0m <REPETITION_PENALTY>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: REPETITION_PENALTY=]\n",
      "      \u001b[1m--frequency-penalty\u001b[0m <FREQUENCY_PENALTY>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: FREQUENCY_PENALTY=]\n",
      "      \u001b[1m--watermark\u001b[0m\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: WATERMARK=]\n",
      "      \u001b[1m--do-sample\u001b[0m\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: DO_SAMPLE=]\n",
      "      \u001b[1m--top-n-tokens\u001b[0m <TOP_N_TOKENS>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_N_TOKENS=]\n",
      "  \u001b[1m-h\u001b[0m, \u001b[1m--help\u001b[0m\n",
      "          Print help (see more with '--help')\n",
      "  \u001b[1m-V\u001b[0m, \u001b[1m--version\u001b[0m\n",
      "          Print version\n"
     ]
    }
   ],
   "source": [
    "!text-generation-benchmark -h"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42d9561b-1aea-4c8c-9fe8-e36af43482fe",
   "metadata": {},
   "source": [
    "Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used by the benchmark tool.\n",
    "\n",
    "<blockquote style=\"border-left: 5px solid #FFAB91; background: #37474F; color: #FFCCBC; padding: 0.5em 1em; margin: 1em 0;\">\n",
    "  <strong>⚠️ Warning:</strong> Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting them here for convenience.\n",
    "</blockquote>\n",
    "\n",
    "```bash\n",
    "text-generation-benchmark \\\n",
    "--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \\\n",
    "--sequence-length 70 \\\n",
    "--decode-length 50 \\\n",
    "--batch-size 1 \\\n",
    "--batch-size 2 \\\n",
    "--batch-size 4 \\\n",
    "--batch-size 8 \\\n",
    "--batch-size 16 \\\n",
    "--batch-size 32 \\\n",
    "--batch-size 64 \\\n",
    "--batch-size 128 \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13ac475b-44e1-47e4-85ce-def2db6879c9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}