Benchmark and Model Zoo

Mirror sites

We only use aliyun to maintain the model zoo since MMDetection V2.0. The model zoo of V1.x has been deprecated.

Common settings

All models were trained on coco_2017_train, and tested on the coco_2017_val.
We use distributed training.
All pytorch-style pretrained backbones on ImageNet are from PyTorch model zoo, caffe-style pretrained backbones are converted from the newly released model from detectron2.
For fair comparison with other codebases, we report the GPU memory as the maximum value of torch.cuda.max_memory_allocated() for all 8 GPUs. Note that this value is usually less than what nvidia-smi shows.
We report the inference time as the total time of network forwarding and post-processing, excluding the data loading time. Results are obtained with the script benchmark.py which computes the average time on 2000 images.

ImageNet Pretrained Models

It is common to initialize from backbone models pre-trained on ImageNet classification task. All pre-trained model links can be found at open_mmlab. According to img_norm_cfg and source of weight, we can divide all the ImageNet pre-trained model weights into some cases:

TorchVision: Corresponding to torchvision weight, including ResNet50, ResNet101. The img_norm_cfg is dict(mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True).
Pycls: Corresponding to pycls weight, including RegNetX. The img_norm_cfg is dict( mean=[103.530, 116.280, 123.675], std=[57.375, 57.12, 58.395], to_rgb=False).
MSRA styles: Corresponding to MSRA weights, including ResNet50_Caffe and ResNet101_Caffe. The img_norm_cfg is dict( mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False).
Caffe2 styles: Currently only contains ResNext101_32x8d. The img_norm_cfg is dict(mean=[103.530, 116.280, 123.675], std=[57.375, 57.120, 58.395], to_rgb=False).
Other styles: E.g SSD which corresponds to img_norm_cfg is dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True) and YOLOv3 which corresponds to img_norm_cfg is dict(mean=[0, 0, 0], std=[255., 255., 255.], to_rgb=True).

The detailed table of the commonly used backbone models in MMDetection is listed below :

model	source	link	description
ResNet50	TorchVision	torchvision's ResNet-50	From torchvision's ResNet-50.
ResNet101	TorchVision	torchvision's ResNet-101	From torchvision's ResNet-101.
RegNetX	Pycls	RegNetX_3.2gf, RegNetX_800mf. etc.	From pycls.
ResNet50_Caffe	MSRA	MSRA's ResNet-50	Converted copy of Detectron2's R-50.pkl model. The original weight comes from MSRA's original ResNet-50.
ResNet101_Caffe	MSRA	MSRA's ResNet-101	Converted copy of Detectron2's R-101.pkl model. The original weight comes from MSRA's original ResNet-101.
ResNext101_32x8d	Caffe2	Caffe2 ResNext101_32x8d	Converted copy of Detectron2's X-101-32x8d.pkl model. The ResNeXt-101-32x8d model trained with Caffe2 at FB.

Baselines

RPN

Please refer to RPN for details.

Faster R-CNN

Please refer to Faster R-CNN for details.

Mask R-CNN

Please refer to Mask R-CNN for details.

Fast R-CNN (with pre-computed proposals)

Please refer to Fast R-CNN for details.

RetinaNet

Please refer to RetinaNet for details.

Cascade R-CNN and Cascade Mask R-CNN

Please refer to Cascade R-CNN for details.

Hybrid Task Cascade (HTC)

Please refer to HTC for details.

SSD

Please refer to SSD for details.

Group Normalization (GN)

Please refer to Group Normalization for details.

Weight Standardization

Please refer to Weight Standardization for details.

Deformable Convolution v2

Please refer to Deformable Convolutional Networks for details.

CARAFE: Content-Aware ReAssembly of FEatures

Please refer to CARAFE for details.

Instaboost

Please refer to Instaboost for details.

Libra R-CNN

Please refer to Libra R-CNN for details.

Guided Anchoring

Please refer to Guided Anchoring for details.

FCOS

Please refer to FCOS for details.

FoveaBox

Please refer to FoveaBox for details.

RepPoints

Please refer to RepPoints for details.

FreeAnchor

Please refer to FreeAnchor for details.

Grid R-CNN (plus)

Please refer to Grid R-CNN for details.

GHM

Please refer to GHM for details.

GCNet

Please refer to GCNet for details.

HRNet

Please refer to HRNet for details.

Mask Scoring R-CNN

Please refer to Mask Scoring R-CNN for details.

Train from Scratch

Please refer to Rethinking ImageNet Pre-training for details.

NAS-FPN

Please refer to NAS-FPN for details.

ATSS

Please refer to ATSS for details.

FSAF

Please refer to FSAF for details.

RegNetX

Please refer to RegNet for details.

Res2Net

Please refer to Res2Net for details.

GRoIE

Please refer to GRoIE for details.

Dynamic R-CNN

Please refer to Dynamic R-CNN for details.

PointRend

Please refer to PointRend for details.

DetectoRS

Please refer to DetectoRS for details.

Generalized Focal Loss

Please refer to Generalized Focal Loss for details.

CornerNet

Please refer to CornerNet for details.

YOLOv3

Please refer to YOLOv3 for details.

PAA

Please refer to PAA for details.

SABL

Please refer to SABL for details.

CentripetalNet

Please refer to CentripetalNet for details.

ResNeSt

Please refer to ResNeSt for details.

DETR

Please refer to DETR for details.

Deformable DETR

Please refer to Deformable DETR for details.

AutoAssign

Please refer to AutoAssign for details.

YOLOF

Please refer to YOLOF for details.

Seesaw Loss

Please refer to Seesaw Loss for details.

CenterNet

Please refer to CenterNet for details.

YOLOX

Please refer to YOLOX for details.

PVT

Please refer to PVT for details.

SOLO

Please refer to SOLO for details.

QueryInst

Please refer to QueryInst for details.

PanopticFPN

Please refer to PanopticFPN for details.

MaskFormer

Please refer to MaskFormer for details.

DyHead

Please refer to DyHead for details.

Mask2Former

Please refer to Mask2Former for details.

Efficientnet

Please refer to Efficientnet for details.

Other datasets

We also benchmark some methods on PASCAL VOC, Cityscapes, OpenImages and WIDER FACE.

Pre-trained Models

We also train Faster R-CNN and Mask R-CNN using ResNet-50 and RegNetX-3.2G with multi-scale training and longer schedules. These models serve as strong pre-trained models for downstream tasks for convenience.

Speed benchmark

Training Speed benchmark

We provide analyze_logs.py to get average time of iteration in training. You can find examples in Log Analysis.

We compare the training speed of Mask R-CNN with some other popular frameworks (The data is copied from detectron2). For mmdetection, we benchmark with mask_rcnn_r50_caffe_fpn_poly_1x_coco_v1.py, which should have the same setting with mask_rcnn_R_50_FPN_noaug_1x.yaml of detectron2. We also provide the checkpoint and training log for reference. The throughput is computed as the average throughput in iterations 100-500 to skip GPU warmup time.

Implementation	Throughput (img/s)
Detectron2	62
MMDetection	61
maskrcnn-benchmark	53
tensorpack	50
simpledet	39
Detectron	19
matterport/Mask_RCNN	14

Inference Speed Benchmark

We provide benchmark.py to benchmark the inference latency. The script benchmarkes the model with 2000 images and calculates the average time ignoring first 5 times. You can change the output log interval (defaults: 50) by setting LOG-INTERVAL.

python tools/benchmark.py ${CONFIG} ${CHECKPOINT} [--log-interval $[LOG-INTERVAL]] [--fuse-conv-bn]

The latency of all models in our model zoo is benchmarked without setting fuse-conv-bn, you can get a lower latency by setting it.

Comparison with Detectron2

We compare mmdetection with Detectron2 in terms of speed and performance. We use the commit id 185c27e(30/4/2020) of detectron. For fair comparison, we install and run both frameworks on the same machine.

Hardware

8 NVIDIA Tesla V100 (32G) GPUs
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

Software environment

Python 3.7
PyTorch 1.4
CUDA 10.1
CUDNN 7.6.03
NCCL 2.4.08

Performance

Type	Lr schd	Detectron2	mmdetection	Download
Faster R-CNN	1x	37.9	38.0	model \| log
Mask R-CNN	1x	38.6 & 35.2	38.8 & 35.4	model \| log
Retinanet	1x	36.5	37.0	model \| log

Training Speed

The training speed is measure with s/iter. The lower, the better.

Type	Detectron2	mmdetection
Faster R-CNN	0.210	0.216
Mask R-CNN	0.261	0.265
Retinanet	0.200	0.205

Inference Speed

The inference speed is measured with fps (img/s) on a single GPU, the higher, the better. To be consistent with Detectron2, we report the pure inference speed (without the time of data loading). For Mask R-CNN, we exclude the time of RLE encoding in post-processing. We also include the officially reported speed in the parentheses, which is slightly higher than the results tested on our server due to differences of hardwares.

Type	Detectron2	mmdetection
Faster R-CNN	25.6 (26.3)	22.2
Mask R-CNN	22.5 (23.3)	19.6
Retinanet	17.8 (18.2)	20.6

Training memory

Type	Detectron2	mmdetection
Faster R-CNN	3.0	3.8
Mask R-CNN	3.4	3.9
Retinanet	3.9	3.4