Denoising Vision Transformer (DVT)
Introduction
We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in the teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (Fig. 1, right, Tabs. 2 to 4). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
Model Summary
We include 4 versions of models in this space:
voc_denoised
: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.voc_distilled
: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:- Stage 1: Perform per-image denoising on the VOC datasets.
- Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
- Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
imgnet_denoised
: The same asvoc_denoised
, but trained on the ImageNet-1k dataset.imgnet_distilled
: The same asvoc_distilled
, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.
Performance Summary
- Baseline: The original ViT models.
Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
---|---|---|---|---|---|---|---|
vit_small_patch14_dinov2.lvd142m | 81.78 | 88.44 | 44.05 | 55.53 | 0.4340 | 0.1331 | 84.49% |
vit_base_patch14_dinov2.lvd142m | 83.52 | 90.60 | 47.02 | 58.45 | 0.3965 | 0.1197 | 87.59% |
vit_large_patch14_dinov2.lvd142m | 83.43 | 90.38 | 47.53 | 59.64 | 0.3831 | 0.1145 | 88.89% |
vit_small_patch14_reg4_dinov2.lvd142m | 80.88 | 88.69 | 44.36 | 55.90 | 0.4328 | 0.1303 | 85.00% |
vit_base_patch14_reg4_dinov2.lvd142m | 83.48 | 90.95 | 47.73 | 60.17 | 0.3967 | 0.1177 | 87.92% |
vit_large_patch14_reg4_dinov2.lvd142m | 83.21 | 90.67 | 48.44 | 61.28 | 0.3852 | 0.1139 | 88.53% |
deit3_base_patch16_224.fb_in1k | 71.03 | 80.67 | 32.84 | 42.79 | 0.5837 | 0.1772 | 73.03% |
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 77.75 | 86.68 | 40.50 | 52.81 | 0.5585 | 0.1678 | 74.30% |
vit_base_patch16_224.dino | 62.92 | 75.98 | 31.03 | 40.62 | 0.5742 | 0.1694 | 74.55% |
vit_base_patch16_224.mae | 50.29 | 63.10 | 23.84 | 32.06 | 0.6629 | 0.2275 | 66.24% |
eva02_base_patch16_clip_224.merged2b | 71.49 | 82.69 | 37.89 | 50.31 | - | - | - |
vit_base_patch16_384.augreg_in21k_ft_in1k | 73.51 | 83.60 | 36.46 | 48.65 | 0.6360 | 0.1898 | 69.10% |
- DVT (voc_denoised): The denoised models trained on the VOC dataset.
Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
---|---|---|---|---|---|---|---|
vit_small_patch14_dinov2.lvd142m | 82.78 | 90.69 | 45.14 | 56.35 | 0.4368 | 0.1337 | 84.34% |
vit_base_patch14_dinov2.lvd142m | 84.92 | 91.74 | 48.54 | 60.21 | 0.3811 | 0.1166 | 88.42% |
vit_large_patch14_dinov2.lvd142m | 85.25 | 91.69 | 49.80 | 61.98 | 0.3826 | 0.1118 | 89.32% |
vit_small_patch14_reg4_dinov2.lvd142m | 81.93 | 89.54 | 45.55 | 57.52 | 0.4251 | 0.1292 | 85.01% |
vit_base_patch14_reg4_dinov2.lvd142m | 84.58 | 91.17 | 49.24 | 61.66 | 0.3898 | 0.1146 | 88.60% |
vit_large_patch14_reg4_dinov2.lvd142m | 84.37 | 91.42 | 49.19 | 62.21 | 0.3852 | 0.1141 | 88.45% |
deit3_base_patch16_224.fb_in1k | 73.52 | 83.65 | 33.57 | 43.56 | 0.5817 | 0.1774 | 73.05% |
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 79.50 | 88.43 | 41.33 | 53.54 | 0.5512 | 0.1639 | 75.30% |
vit_base_patch16_224.dino | 66.41 | 77.75 | 32.45 | 42.42 | 0.5784 | 0.1738 | 73.75% |
vit_base_patch16_224.mae | 50.65 | 62.90 | 23.25 | 31.03 | 0.6651 | 0.2271 | 65.44% |
eva02_base_patch16_clip_224.merged2b | 73.76 | 84.50 | 37.99 | 50.40 | 0.6196 | 0.1904 | 69.86% |
vit_base_patch16_384.augreg_in21k_ft_in1k | 74.82 | 84.40 | 36.75 | 48.82 | 0.6316 | 0.1921 | 69.37% |
- DVT (voc_distilled): The distilled models trained on the VOC dataset.
Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
---|---|---|---|---|---|---|---|
vit_base_patch14_dinov2.lvd142m | 85.10 | 91.41 | 48.57 | 60.35 | 0.3850 | 0.1207 | 88.25% |
vit_base_patch14_reg4_dinov2.lvd142m | 84.36 | 90.80 | 49.20 | 61.56 | 0.3838 | 0.1143 | 88.97% |
deit3_base_patch16_224.fb_in1k | 73.63 | 82.74 | 34.43 | 44.96 | 0.5712 | 0.1747 | 74.00% |
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 79.86 | 88.33 | 42.28 | 54.26 | 0.5253 | 0.1571 | 77.23% |
vit_base_patch16_224.dino | 66.80 | 78.47 | 32.68 | 42.58 | 0.5750 | 0.1696 | 73.86% |
vit_base_patch16_224.mae | 51.91 | 64.67 | 23.73 | 31.88 | 0.6733 | 0.2282 | 65.33% |
eva02_base_patch16_clip_224.merged2b | 75.93 | 85.44 | 40.15 | 52.04 | - | - | - |
vit_base_patch16_384.augreg_in21k_ft_in1k | 76.26 | 85.14 | 38.62 | 50.61 | 0.5825 | 0.1768 | 73.14% |
- DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.
Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
---|---|---|---|---|---|---|---|
vit_base_patch14_dinov2.lvd142m (denoised) | 85.17 | 91.55 | 48.68 | 60.60 | 0.3832 | 0.1152 | 88.50% |
vit_base_patch14_dinov2.lvd142m (distilled) | 85.33 | 91.48 | 48.85 | 60.47 | 0.3704 | 0.1115 | 89.74% |
A summary of DINOv2-base model is shown below:
vit_base_patch14_dinov2.lvd142m | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
---|---|---|---|---|---|---|---|
baseline | 83.52 | 90.60 | 47.02 | 58.45 | 0.3965 | 0.1197 | 87.59% |
voc_denoised |
84.92 | 91.74 | 48.54 | 60.21 | 0.3811 | 0.1166 | 88.42% |
voc_distilled |
85.10 | 91.41 | 48.57 | 60.35 | 0.3850 | 0.1207 | 88.25% |
imgnet_denoised |
85.17 | 91.55 | 48.68 | 60.60 | 0.3832 | 0.1152 | 88.50% |
imgnet_distilled |
85.33 | 91.48 | 48.85 | 60.47 | 0.3704 | 0.1115 | 89.74% |
In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the cls
token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.
However, we do not include this model in the final release because of the additional complexity but non-significant improvement.
Citation
If you find this project useful, please consider citing:
@inproceedings{yang2024denoising,
title={Denoising vision transformers},
author={Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
booktitle={ECCV},
year={2024}
}