{ "cells": [ { "cell_type": "markdown", "id": "28e3460d-59b1-4d4c-b62e-510987fb2f28", "metadata": {}, "source": [ "# Introduction\n", "This notebook is to show how to launch the TGI Benchmark tool. \n" ] }, { "cell_type": "markdown", "id": "c0de3cc9-c6cd-45b3-9dd0-84b3cb2fc8b2", "metadata": {}, "source": [ "Here we can see the different settings for TGI Benchmark. \n", "\n", "Here are some of the more important ones:\n", "\n", "- `--tokenizer-name` This is required so the tool knows what tokenizer to use\n", "- `--batch-size` This is important for load testing. We should use more and more values to see what happens to throughput and latency\n", "- `--sequence-length` AKA input tokens, it is important to match your use-case needs\n", "- `--decode-length` AKA output tokens, it is important to match your use-case needs\n", "- `--runs` 10 is the default\n", "\n", "
\n", " 💡 Tip: Use a low number for --runs when you are exploring but a higher number as you finalize to get more precise statistics\n", "
\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "694df6d6-a521-4dab-977b-2828d4250781", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text Generation Benchmarking tool\n", "\n", "\u001B[1m\u001B[4mUsage:\u001B[0m \u001B[1mtext-generation-benchmark\u001B[0m [OPTIONS] \u001B[1m--tokenizer-name\u001B[0m \n", "\n", "\u001B[1m\u001B[4mOptions:\u001B[0m\n", " \u001B[1m-t\u001B[0m, \u001B[1m--tokenizer-name\u001B[0m \n", " The name of the tokenizer (as in model_id on the huggingface hub, or local path) [env: TOKENIZER_NAME=]\n", " \u001B[1m--revision\u001B[0m \n", " The revision to use for the tokenizer if on the hub [env: REVISION=] [default: main]\n", " \u001B[1m-b\u001B[0m, \u001B[1m--batch-size\u001B[0m \n", " The various batch sizes to benchmark for, the idea is to get enough batching to start seeing increased latency, this usually means you're moving from memory bound (usual as BS=1) to compute bound, and this is a sweet spot for the maximum batch size for the model under test\n", " \u001B[1m-s\u001B[0m, \u001B[1m--sequence-length\u001B[0m \n", " This is the initial prompt sent to the text-generation-server length in token. Longer prompt will slow down the benchmark. Usually the latency grows somewhat linearly with this for the prefill step [env: SEQUENCE_LENGTH=] [default: 10]\n", " \u001B[1m-d\u001B[0m, \u001B[1m--decode-length\u001B[0m \n", " This is how many tokens will be generated by the server and averaged out to give the `decode` latency. This is the *critical* number you want to optimize for LLM spend most of their time doing decoding [env: DECODE_LENGTH=] [default: 8]\n", " \u001B[1m-r\u001B[0m, \u001B[1m--runs\u001B[0m \n", " How many runs should we average from [env: RUNS=] [default: 10]\n", " \u001B[1m-w\u001B[0m, \u001B[1m--warmups\u001B[0m \n", " Number of warmup cycles [env: WARMUPS=] [default: 1]\n", " \u001B[1m-m\u001B[0m, \u001B[1m--master-shard-uds-path\u001B[0m \n", " The location of the grpc socket. This benchmark tool bypasses the router completely and directly talks to the gRPC processes [env: MASTER_SHARD_UDS_PATH=] [default: /tmp/text-generation-server-0]\n", " \u001B[1m--temperature\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TEMPERATURE=]\n", " \u001B[1m--top-k\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_K=]\n", " \u001B[1m--top-p\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_P=]\n", " \u001B[1m--typical-p\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TYPICAL_P=]\n", " \u001B[1m--repetition-penalty\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: REPETITION_PENALTY=]\n", " \u001B[1m--frequency-penalty\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: FREQUENCY_PENALTY=]\n", " \u001B[1m--watermark\u001B[0m\n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: WATERMARK=]\n", " \u001B[1m--do-sample\u001B[0m\n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: DO_SAMPLE=]\n", " \u001B[1m--top-n-tokens\u001B[0m \n", " Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_N_TOKENS=]\n", " \u001B[1m-h\u001B[0m, \u001B[1m--help\u001B[0m\n", " Print help (see more with '--help')\n", " \u001B[1m-V\u001B[0m, \u001B[1m--version\u001B[0m\n", " Print version\n" ] } ], "source": [ "!text-generation-benchmark -h" ] }, { "cell_type": "markdown", "id": "42d9561b-1aea-4c8c-9fe8-e36af43482fe", "metadata": {}, "source": [ "Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used by the benchmark tool. I'm also considering which batch sizes are important based on estimated user activity.\n", "\n", "
\n", " ⚠️ Warning: Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting them here for convenience.\n", "
\n", "\n", "```bash\n", "text-generation-benchmark \\\n", "--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \\\n", "--sequence-length 70 \\\n", "--decode-length 50 \\\n", "--batch-size 1 \\\n", "--batch-size 2 \\\n", "--batch-size 4 \\\n", "--batch-size 8 \\\n", "--batch-size 16 \\\n", "--batch-size 32 \\\n", "--batch-size 64 \\\n", "--batch-size 128 \n", "```\n", "\n", "Hit `q` to stop the tool." ] }, { "cell_type": "code", "execution_count": null, "id": "13ac475b-44e1-47e4-85ce-def2db6879c9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }