Denoising Vision Transformer (DVT)

Introduction

We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in the teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (Fig. 1, right, Tabs. 2 to 4). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.

Model Summary

We include 4 versions of models in this space:

voc_denoised: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.
voc_distilled: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:
1. Stage 1: Perform per-image denoising on the VOC datasets.
2. Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
3. Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
imgnet_denoised: The same as voc_denoised, but trained on the ImageNet-1k dataset.
imgnet_distilled: The same as voc_distilled, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.

Performance Summary

Baseline: The original ViT models.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_small_patch14_dinov2.lvd142m	81.78	88.44	44.05	55.53	0.4340	0.1331	84.49%
vit_base_patch14_dinov2.lvd142m	83.52	90.60	47.02	58.45	0.3965	0.1197	87.59%
vit_large_patch14_dinov2.lvd142m	83.43	90.38	47.53	59.64	0.3831	0.1145	88.89%
vit_small_patch14_reg4_dinov2.lvd142m	80.88	88.69	44.36	55.90	0.4328	0.1303	85.00%
vit_base_patch14_reg4_dinov2.lvd142m	83.48	90.95	47.73	60.17	0.3967	0.1177	87.92%
vit_large_patch14_reg4_dinov2.lvd142m	83.21	90.67	48.44	61.28	0.3852	0.1139	88.53%
deit3_base_patch16_224.fb_in1k	71.03	80.67	32.84	42.79	0.5837	0.1772	73.03%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	77.75	86.68	40.50	52.81	0.5585	0.1678	74.30%
vit_base_patch16_224.dino	62.92	75.98	31.03	40.62	0.5742	0.1694	74.55%
vit_base_patch16_224.mae	50.29	63.10	23.84	32.06	0.6629	0.2275	66.24%
eva02_base_patch16_clip_224.merged2b	71.49	82.69	37.89	50.31	-	-	-
vit_base_patch16_384.augreg_in21k_ft_in1k	73.51	83.60	36.46	48.65	0.6360	0.1898	69.10%

DVT (voc_denoised): The denoised models trained on the VOC dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_small_patch14_dinov2.lvd142m	82.78	90.69	45.14	56.35	0.4368	0.1337	84.34%
vit_base_patch14_dinov2.lvd142m	84.92	91.74	48.54	60.21	0.3811	0.1166	88.42%
vit_large_patch14_dinov2.lvd142m	85.25	91.69	49.80	61.98	0.3826	0.1118	89.32%
vit_small_patch14_reg4_dinov2.lvd142m	81.93	89.54	45.55	57.52	0.4251	0.1292	85.01%
vit_base_patch14_reg4_dinov2.lvd142m	84.58	91.17	49.24	61.66	0.3898	0.1146	88.60%
vit_large_patch14_reg4_dinov2.lvd142m	84.37	91.42	49.19	62.21	0.3852	0.1141	88.45%
deit3_base_patch16_224.fb_in1k	73.52	83.65	33.57	43.56	0.5817	0.1774	73.05%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	79.50	88.43	41.33	53.54	0.5512	0.1639	75.30%
vit_base_patch16_224.dino	66.41	77.75	32.45	42.42	0.5784	0.1738	73.75%
vit_base_patch16_224.mae	50.65	62.90	23.25	31.03	0.6651	0.2271	65.44%
eva02_base_patch16_clip_224.merged2b	73.76	84.50	37.99	50.40	0.6196	0.1904	69.86%
vit_base_patch16_384.augreg_in21k_ft_in1k	74.82	84.40	36.75	48.82	0.6316	0.1921	69.37%

DVT (voc_distilled): The distilled models trained on the VOC dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_base_patch14_dinov2.lvd142m	85.10	91.41	48.57	60.35	0.3850	0.1207	88.25%
vit_base_patch14_reg4_dinov2.lvd142m	84.36	90.80	49.20	61.56	0.3838	0.1143	88.97%
deit3_base_patch16_224.fb_in1k	73.63	82.74	34.43	44.96	0.5712	0.1747	74.00%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	79.86	88.33	42.28	54.26	0.5253	0.1571	77.23%
vit_base_patch16_224.dino	66.80	78.47	32.68	42.58	0.5750	0.1696	73.86%
vit_base_patch16_224.mae	51.91	64.67	23.73	31.88	0.6733	0.2282	65.33%
eva02_base_patch16_clip_224.merged2b	75.93	85.44	40.15	52.04	-	-	-
vit_base_patch16_384.augreg_in21k_ft_in1k	76.26	85.14	38.62	50.61	0.5825	0.1768	73.14%

DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_base_patch14_dinov2.lvd142m (denoised)	85.17	91.55	48.68	60.60	0.3832	0.1152	88.50%
vit_base_patch14_dinov2.lvd142m (distilled)	85.33	91.48	48.85	60.47	0.3704	0.1115	89.74%

A summary of DINOv2-base model is shown below:

vit_base_patch14_dinov2.lvd142m	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
baseline	83.52	90.60	47.02	58.45	0.3965	0.1197	87.59%
`voc_denoised`	84.92	91.74	48.54	60.21	0.3811	0.1166	88.42%
`voc_distilled`	85.10	91.41	48.57	60.35	0.3850	0.1207	88.25%
`imgnet_denoised`	85.17	91.55	48.68	60.60	0.3832	0.1152	88.50%
`imgnet_distilled`	85.33	91.48	48.85	60.47	0.3704	0.1115	89.74%

In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the cls token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.

However, we do not include this model in the final release because of the additional complexity but non-significant improvement.

Citation

If you find this project useful, please consider citing:

@inproceedings{yang2024denoising,
  title={Denoising vision transformers},
  author={Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  booktitle={ECCV},
  year={2024}
}