gpt2-tensorrt / README.md
ivkalgin's picture
Update README.md (#1)
08766db
|
raw
history blame
1.41 kB
metadata
license: apache-2.0
datasets:
  - lambada
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - text-generation-inference
  - causal-lm
  - int8
  - tensorrt
  - ENOT-AutoDL

GPT2

This repository contains GPT2 onnx models compatible with TensorRT:

  • gpt2-xl.onnx - GPT2-XL onnx for fp32 or fp16 engines
  • gpt2-xl-i8.onnx - GPT2-XL onnx for int8+fp32 engines

Quantization of models was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.

Metrics:

GPT2-XL

TensorRT INT8+FP32 torch FP16
Lambada Acc 72.11% 71.43%

Test environment

  • GPU RTX 4090
  • CPU 11th Gen Intel(R) Core(TM) i7-11700K
  • TensorRT 8.5.3.1
  • pytorch 1.13.1+cu116

Latency:

GPT2-XL

Input sequance length Number of generated tokens TensorRT INT8+FP32 ms torch FP16 ms Acceleration
64 64 462 1190 2.58
64 128 920 2360 2.54
64 256 1890 4710 2.54

Test environment

  • GPU RTX 4090
  • CPU 11th Gen Intel(R) Core(TM) i7-11700K
  • TensorRT 8.5.3.1
  • pytorch 1.13.1+cu116

How to use

Example of inference and accuracy test published on github:

git clone https://github.com/ENOT-AutoDL/ENOT-transformers