jun-untitled
commited on
Commit
•
c9fe1ed
1
Parent(s):
c5721b4
Update README.md
Browse files
README.md
CHANGED
@@ -19,13 +19,13 @@ inference: false
|
|
19 |
|
20 |
# Vision Transformer (large-sized model)
|
21 |
|
22 |
-
Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/
|
23 |
|
24 |
Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
|
25 |
|
26 |
## Model description
|
27 |
|
28 |
-
The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/
|
29 |
|
30 |
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
|
31 |
|
@@ -57,7 +57,7 @@ WIP
|
|
57 |
|
58 |
## Training data
|
59 |
|
60 |
-
The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/subset/
|
61 |
|
62 |
## Training procedure
|
63 |
|
|
|
19 |
|
20 |
# Vision Transformer (large-sized model)
|
21 |
|
22 |
+
Vision Transformer (ViT) model pre-trained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) (300 million images, 21,841 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. However, since the JFT-300M is a private dataset, we tried to reproduce it using the publicly available [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M) dataset.
|
23 |
|
24 |
Thanks to Hugging Face team for converting weights of ViT trained in Tensorflow to be used on Pytorch, JAX/Flax and Tensorflow in Hugging Face.
|
25 |
|
26 |
## Model description
|
27 |
|
28 |
+
The Vision Transformer (ViT) is a transformer model pretrained on a large collection of images in a supervised fashion, namely [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), at a resolution of 224x224 pixels.
|
29 |
|
30 |
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer.
|
31 |
|
|
|
57 |
|
58 |
## Training data
|
59 |
|
60 |
+
The ViT model was pretrained on [COYO-Labeled-300M](https://github.com/kakaobrain/coyo-dataset/tree/main/subset/COYO-Labeled-300M), a dataset consisting of 300 million images and 21k classes.
|
61 |
|
62 |
## Training procedure
|
63 |
|