Snarci commited on
Commit
17f37cd
1 Parent(s): a3788e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -6
README.md CHANGED
@@ -13,9 +13,7 @@ widget:
13
 
14
  # Vision Transformer (base-sized model)
15
 
16
- Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
17
-
18
- Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set and evaluated on the official test set using the best validation model based on the loss
19
 
20
  # Augmentation pipeline
21
  To address the issue of class imbalance in our training set, we performed oversampling with repetition.
@@ -78,8 +76,7 @@ Currently, both the feature extractor and model support PyTorch. Tensorflow and
78
 
79
  ## Training data
80
 
81
- The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.
82
- Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set
83
 
84
  ## Training procedure
85
 
@@ -87,7 +84,7 @@ Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.c
87
 
88
  The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).
89
 
90
- Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
91
 
92
  # License
93
  This model is provided for non-commercial use only and may not be used in any research or publication without prior written consent from the author.
 
13
 
14
  # Vision Transformer (base-sized model)
15
 
16
+ Vision Transformer (ViT) model trained on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set and evaluated on the official test set using the best validation model based on the loss
 
 
17
 
18
  # Augmentation pipeline
19
  To address the issue of class imbalance in our training set, we performed oversampling with repetition.
 
76
 
77
  ## Training data
78
 
79
+ The ViT model was tuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set
 
80
 
81
  ## Training procedure
82
 
 
84
 
85
  The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).
86
 
87
+ Images are resized/rescaled to the same resolution 384x384 during training and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
88
 
89
  # License
90
  This model is provided for non-commercial use only and may not be used in any research or publication without prior written consent from the author.