File size: 5,862 Bytes
768efd1
faa9bf6
768efd1
 
faa9bf6
775024c
 
706cd52
ea93768
4f77532
 
e894748
 
 
 
 
 
 
768efd1
 
775024c
768efd1
775024c
768efd1
67c62d8
768efd1
f0d65e5
 
 
 
 
bd93f97
f0d65e5
a76a34b
 
 
 
 
 
f0d65e5
 
 
 
 
 
 
 
 
 
 
70472be
 
 
67c62d8
70472be
 
5e22626
 
 
 
 
 
768efd1
775024c
768efd1
775024c
768efd1
775024c
 
 
 
 
 
8f927df
 
775024c
 
 
 
 
 
 
768efd1
775024c
768efd1
775024c
768efd1
367f924
67c62d8
768efd1
775024c
768efd1
775024c
768efd1
775024c
768efd1
775024c
768efd1
28e8fb5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
tags:
- vision
- image-classification
datasets:
- imagenet
- imagenet-21k
- ChaoYang
widget:
  - src: >-
      https://i.ibb.co/Qr6bSRw/Adenoma.jpg
    example_title: Adenoma
  - src: >-
      https://i.ibb.co/6WBDyNp/Normal.jpg
    example_title: Normal
  - src: >-
      https://i.ibb.co/CvH8nLV/Serrated.jpg
    example_title: Serrated
---

# Vision Transformer (base-sized model) 

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. 

Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set and evaluated on the official test set using the best validation model based on the loss

# Augmentation pipeline
To address the issue of class imbalance in our training set, we performed oversampling with repetition. 
Specifically, we duplicated the minority classes images until we obtained an even distribution across all classes. 
This resulted in a larger training set, but ensured that our model was exposed to an equal number of samples from each class during training.
We verified that this approach did not lead to overfitting or other issues by using a validation set with the original class distribution.
We used the following [Albumentations](https://github.com/albumentations-team/albumentations)augmentation pipeline for our experiments:

- A.Resize(img_size, img_size),
- A.HorizontalFlip(p=0.5),
- A.VerticalFlip(p=0.5),
- A.RandomRotate90(p=0.5),
- A.RandomResizedCrop(img_size, img_size, scale=(0.5, 1.0), p=0.5),
- ToTensorV2(p=1.0)

This pipeline consists of the following transformations:

- Resize: resizes the image to a fixed size of (img_size, img_size).
- HorizontalFlip: flips the image horizontally with a probability of 0.5.
- VerticalFlip: flips the image vertically with a probability of 0.5.
- RandomRotate90: randomly rotates the image by 90, 180, or 270 degrees with a probability of 0.5.
- RandomResizedCrop: randomly crops and resizes the image to a size between 50% and 100% of the original size, with a probability of 0.5.
- ToTensorV2: converts the image to a PyTorch tensor.

These transformations were chosen to augment the dataset with a variety of geometric transformations, while preserving important visual features.
# Results

Our model represents the current state-of-the-art in the field, as it outperforms previous state-of-the-art models proposed in papers with code,
based on the dataset's [reference paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9600806&tag=1).
The results are summarized in the following table using macro avg metrics.

| Model                      | Accuracy    | F1-Score    | Precision   | Recall      |
|----------------------------|-------------|-------------|-------------|-------------|
| Baseline                   | 0.83        | 0.77        |  0.78       | 0.75        |
| Vit-384-finetuned          | 0.86  ↑3%   | 0.81  ↑4%   |  0.82  ↑4%  | 0.80  ↑5%   |
| Vit-384-from-scratch       | 0.78        | 0.74        |  0.74       | 0.74        |
| Vit-224-distilled-resnet50 | 0.74        | 0.00        |  0.00       | 0.00        |

### How to use

Here is how to use this model to classify an image of the Chaoyang dataset into one of the 4 classes:

```python
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('Snarci/ViT-base-patch16-384-Chaoyang-finetuned')
model = ViTForImageClassification.from_pretrained('Snarci/ViT-base-patch16-384-Chaoyang-finetuned')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 4 Chaoyang classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
```

Currently, both the feature extractor and model  support PyTorch. Tensorflow and JAX/FLAX are coming soon, and the API of ViTFeatureExtractor might change.

## Training data

The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes. 
Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set

## Training procedure

### Preprocessing

The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py). 

Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

# License
This model is provided for non-commercial use only and may not be used in any research or publication without prior written consent from the author.