Update README.md
Browse files
README.md
CHANGED
@@ -10,13 +10,13 @@ January 2021
|
|
10 |
|
11 |
### Model Type
|
12 |
|
13 |
-
The base model uses a
|
14 |
|
15 |
### Model Version
|
16 |
|
17 |
Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
|
18 |
|
19 |
-
|
20 |
|
21 |
Please see the paper linked below for further details about their specification.
|
22 |
|
|
|
10 |
|
11 |
### Model Type
|
12 |
|
13 |
+
The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
|
14 |
|
15 |
### Model Version
|
16 |
|
17 |
Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
|
18 |
|
19 |
+
*This port does not include the ResNet model.*
|
20 |
|
21 |
Please see the paper linked below for further details about their specification.
|
22 |
|