|
--- |
|
license: cc-by-4.0 |
|
datasets: |
|
- imagenet-1k |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-classification |
|
language: |
|
- en |
|
tags: |
|
- vision transformer |
|
- simpool |
|
- dino |
|
- computer vision |
|
- deep learning |
|
--- |
|
|
|
# Self-supervised ViT-S/16 (small-sized Vision Transformer with patch size 16) model with SimPool |
|
|
|
ViT-S model with SimPool (no gamma) trained on ImageNet-1k for 100 epochs. Self-supervision with [DINO](https://arxiv.org/abs/2104.14294). |
|
|
|
SimPool is a simple attention-based pooling method at the end of network, introduced on this ICCV 2023 [paper](https://arxiv.org/pdf/2309.06891.pdf) and released in this [repository](https://github.com/billpsomas/simpool/). |
|
Disclaimer: This model card is written by the author of SimPool, i.e. [Bill Psomas](http://users.ntua.gr/psomasbill/). |
|
|
|
## Motivation |
|
|
|
Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? |
|
As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? |
|
|
|
## Method |
|
|
|
SimPool is a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. For transformers, we completely discard the [CLS] token. |
|
Interestingly, we find that, whether supervised or self-supervised, SimPool improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. |
|
One could thus call SimPool universal. |
|
|
|
## Evaluation with k-NN |
|
|
|
| k | top1 | top5 | |
|
| ------- | ------- | ------- | |
|
| 10 | 69.778 | 85.91 | |
|
| 20 | 69.602 | 87.54 | |
|
| 100 | 67.318 | 88.674 | |
|
| 200 | 65.966 | 88.404 | |
|
|
|
## BibTeX entry and citation info |
|
|
|
``` |
|
@misc{psomas2023simpool, |
|
title={Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?}, |
|
author={Bill Psomas and Ioannis Kakogeorgiou and Konstantinos Karantzalos and Yannis Avrithis}, |
|
year={2023}, |
|
eprint={2309.06891}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV} |
|
} |
|
``` |
|
|