DCNv4

News

Jan 15, 2024: 🚀 Compared with InternImage, the new FlashInternImage powered with DCNv4 has faster inference speed, faster convergence, and better performance!!!
Jan 15, 2024: 🚀 "DCNv4" is released！

Introduction

We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Released Models

ImageNet Image Classification

name	pretrain	resolution	acc@1	#param	download
FlashInternImage-T	ImageNet-1K	224x224	83.6	30M	ckpt \| cfg
FlashInternImage-S	ImageNet-1K	224x224	84.4	50M	ckpt \| cfg
FlashInternImage-B	ImageNet-1K	224x224	84.9	97M	ckpt \| cfg
FlashInternImage-L	ImageNet-22K	384x384	88.1	223M	ckpt \| cfg

COCO Object Detection and Instance Segmentation

backbone	method	schd	box mAP	mask mAP	Config	Download
FlashInternImage-T	Mask-RCNN	1x	48.0	43.1	config	ckpt \| log
FlashInternImage-T	Mask-RCNN	3x	49.5	44.0	config	ckpt \| log
FlashInternImage-S	Mask-RCNN	1x	49.2	44.0	config	ckpt \| log
FlashInternImage-S	Mask-RCNN	3x	50.5	44.9	config	ckpt \| log
FlashInternImage-B	Mask-RCNN	1x	50.1	44.5	config	ckpt \| log
FlashInternImage-B	Mask-RCNN	3x	50.6	45.4	config	ckpt \| log

backbone	method	schd	box mAP	mask mAP	Config	Download
FlashInternImage-L	Cascade Mask R-CNN	1x	55.6	48.2	config	ckpt \| log
FlashInternImage-L	Cascade Mask R-CNN	3x	56.7	48.9	config	ckpt

backbone	method	lr type	pretrain	schd	box mAP	Config	Download
FlashInternImage-T	DINO	layer-wise lr	ImageNet-1K	1x	54.7	config	ckpt \| log
FlashInternImage-S	DINO	layer-wise lr	ImageNet-1K	1x	55.3	config	ckpt \| log
FlashInternImage-B	DINO	layer-wise lr	ImageNet-1K	1x	56.0	config	ckpt \| log
FlashInternImage-L	DINO	0.1x backbone lr	ImageNet-22K	1x	58.8	config	ckpt \| log

ADE20K Semantic Segmentation

backbone	method	resolution	mIoU (ss/ms)	Config	Download
FlashInternImage-T	UperNet	512x512	49.3 / 50.3	config	ckpt \| log
FlashInternImage-S	UperNet	512x512	50.6 / 51.6	config	ckpt \| log
FlashInternImage-B	UperNet	512x512	52.0 / 52.6	config	ckpt \| log
FlashInternImage-L	UperNet	640x640	55.6 / 56.0	config	ckpt \| log

backbone	method	resolution	mIoU (ss)	Config	Download
FlashInternImage-T	Mask2Former	512x512	51.2	config	ckpt \| log
FlashInternImage-S	Mask2Former	640x640	52.6	config	ckpt \| log
FlashInternImage-B	Mask2Former	640x640	53.4	config	ckpt \| log
FlashInternImage-L	Mask2Former	640x640	56.7	config	ckpt \| log

Citations

If this work is helpful for your research, please consider citing the following BibTeX entry.


@article{xiong2024efficient,
      title={Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications}, 
      author={Yuwen Xiong and Zhiqi Li and Yuntao Chen and Feng Wang and Xizhou Zhu and Jiapeng Luo and Wenhai Wang and Tong Lu and Hongsheng Li and Yu Qiao and Lewei Lu and Jie Zhou and Jifeng Dai},
      journal={arXiv preprint arXiv:2401.06197},
      year={2024}
}

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

@inproceedings{zhu2022uni,
  title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
  author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
  booktitle={CVPR},
  pages={16804--16815},
  year={2022}
}

@article{zhu2022uni,
  title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
  author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
  journal={arXiv preprint arXiv:2206.04674},
  year={2022}
}

@article{li2022uni,
  title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
  author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
  journal={arXiv preprint arXiv:2211.09808},
  year={2022}
}

@article{yang2022bevformer,
  title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
  author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
  journal={arXiv preprint arXiv:2211.10439},
  year={2022}
}

@article{su2022towards,
  title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
  author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
  journal={arXiv preprint arXiv:2211.09807},
  year={2022}
}

@inproceedings{li2022bevformer,
  title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  booktitle={ECCV},
  pages={1--18},
  year={2022},
}