Snarci
/

ViT-base-patch16-384-Chaoyang-from-scratch

@@ -13,9 +13,7 @@ widget:
 # Vision Transformer (base-sized model)
-Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 384x384. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
-Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set and evaluated on the official test set using the best validation model based on the loss
 # Augmentation pipeline
 To address the issue of class imbalance in our training set, we performed oversampling with repetition.
@@ -78,8 +76,7 @@ Currently, both the feature extractor and model  support PyTorch. Tensorflow and
 ## Training data
-The ViT model was pretrained on [ImageNet-21k](http://www.image-net.org/), a dataset consisting of 14 million images and 21k classes, fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/), a dataset consisting of 1 million images and 1k classes.
-Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set
 ## Training procedure
@@ -87,7 +84,7 @@ Finally the ViT was finetuned on the [Chaoyang dataset](https://paperswithcode.c
 The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).
-Images are resized/rescaled to the same resolution (224x224 during pre-training, 384x384 during fine-tuning) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
 # License
 This model is provided for non-commercial use only and may not be used in any research or publication without prior written consent from the author.

 # Vision Transformer (base-sized model)
+Vision Transformer (ViT) model  trained on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set and evaluated on the official test set using the best validation model based on the loss
 # Augmentation pipeline
 To address the issue of class imbalance in our training set, we performed oversampling with repetition.
 ## Training data
+The ViT model was tuned on the [Chaoyang dataset](https://paperswithcode.com/dataset/chaoyang) at resolution 384x384, using a fixed 10% of the training set as the validation set
 ## Training procedure
 The exact details of preprocessing of images during training/validation can be found [here](https://github.com/google-research/vision_transformer/blob/master/vit_jax/input_pipeline.py).
+Images are resized/rescaled to the same resolution  384x384 during training and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
 # License
 This model is provided for non-commercial use only and may not be used in any research or publication without prior written consent from the author.