Spaces:
Running
Running
fixing typos
Browse files- introduction.md +12 -13
introduction.md
CHANGED
@@ -71,7 +71,7 @@ MSCOCO dataset and have been translated with Microsoft Translator. The 2017 vers
|
|
71 |
|
72 |
+ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
|
73 |
the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
|
74 |
-
this dataset
|
75 |
could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
|
76 |
a dataset with 700K translated captions.
|
77 |
|
@@ -83,14 +83,14 @@ Each photo comes along with an Italian caption.
|
|
83 |
### A Note on Translations
|
84 |
|
85 |
Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
|
86 |
-
reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource
|
87 |
but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
|
88 |
|
89 |
Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
|
90 |
-
The meaning of the value is as follows: 1, the sentence has lost is meaning or it's not possible to understand it; 2, it is possible to get the idea
|
91 |
-
but there something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
|
92 |
|
93 |
-
The average score was of 3.78 and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
|
94 |
weighting - of 0.858 (great agreement!).
|
95 |
|
96 |
| English Captions | Italian Captions |
|
@@ -99,7 +99,6 @@ weighting - of 0.858 (great agreement!).
|
|
99 |
| person walking down the aisle | persona che cammina lungo la navata |
|
100 |
| popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
|
101 |
|
102 |
-
\t\t\t
|
103 |
We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
|
104 |
that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
|
105 |
|
@@ -113,7 +112,7 @@ While we would have liked to have augmentations for the captions as well, after
|
|
113 |
|
114 |
After different trials, we realized that the usual way of training this model was
|
115 |
not good enough to get good results. We thus modified three different parts of the
|
116 |
-
training pipeline: the optimizer, the training with frozen components and the logit_scale parameter.
|
117 |
|
118 |
### Optimizer
|
119 |
|
@@ -124,9 +123,9 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
|
|
124 |
|
125 |
### Backbone Freezing
|
126 |
|
127 |
-
The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably
|
128 |
The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
|
129 |
-
Only after these layers converged we
|
130 |
|
131 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
|
132 |
|
@@ -146,14 +145,14 @@ The following picture showcases the effect that these edits have had on our eval
|
|
146 |
The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
|
147 |
Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
|
148 |
also converges significantly faster! The blue line shows the results when
|
149 |
-
fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
|
150 |
results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
|
151 |
to reduce the loss.
|
152 |
|
153 |
|
154 |
# Scientific Validity
|
155 |
|
156 |
-
We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is
|
157 |
We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
|
158 |
code made available by Nils Reimers and by the authors of the original CLIP) is available.
|
159 |
|
@@ -195,7 +194,7 @@ described by the original caption. As evaluation metrics we use the MRR@K.
|
|
195 |
| MRR@5 | **0.5039** | 0.3957|
|
196 |
| MRR@10 | **0.5204** | 0.4129|
|
197 |
|
198 |
-
_If the table above
|
199 |
|
200 |
It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
|
201 |
on 400million images (and some of them might have been from MSCOCO).
|
@@ -238,7 +237,7 @@ Look at the following - slightly cherry picked - examples:
|
|
238 |
Here's "a yellow flower"
|
239 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
|
240 |
|
241 |
-
And here's "a
|
242 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
|
243 |
|
244 |
### Counting
|
|
|
71 |
|
72 |
+ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
|
73 |
the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
|
74 |
+
this dataset that have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
|
75 |
could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
|
76 |
a dataset with 700K translated captions.
|
77 |
|
|
|
83 |
### A Note on Translations
|
84 |
|
85 |
Instead of relying on open-source translators, we decided to use DeepL. **Translation quality** of the data was the main
|
86 |
+
reason of this choice. With the few images (wrt OpenAI) that we have, we cannot risk polluting our own data. CC is a great resource,
|
87 |
but the captions have to be handled accordingly. We translated 700K captions and we evaluated their quality:
|
88 |
|
89 |
Three of us looked at a sample of 100 of the translations and rated them with scores from 1 to 4.
|
90 |
+
The meaning of the value is as follows: 1, the sentence has lost is meaning, or it's not possible to understand it; 2, it is possible to get the idea
|
91 |
+
but there is something wrong; 3, good, however a native speaker might complain about some translations; 4, good translation.
|
92 |
|
93 |
+
The average score was of 3.78, and the three annotators had an inter-rater agreement - computed with [Gwet's AC1](https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1348/000711006X126600) using ordinal
|
94 |
weighting - of 0.858 (great agreement!).
|
95 |
|
96 |
| English Captions | Italian Captions |
|
|
|
99 |
| person walking down the aisle | persona che cammina lungo la navata |
|
100 |
| popular rides at night at the county fair | giostre popolari di notte alla fiera della contea |
|
101 |
|
|
|
102 |
We know that we annotated our own data; in the spirit of fairness we also share the annotations and the captions so
|
103 |
that those interested can check the quality. The Google Sheet is [here](https://docs.google.com/spreadsheets/d/1m6TkcpJbmJlEygL7SXURIq2w8ZHuVvsmdEuCIH0VENk/edit?usp=sharing).
|
104 |
|
|
|
112 |
|
113 |
After different trials, we realized that the usual way of training this model was
|
114 |
not good enough to get good results. We thus modified three different parts of the
|
115 |
+
training pipeline: the optimizer, the training with frozen components, and the fixed logit_scale parameter.
|
116 |
|
117 |
### Optimizer
|
118 |
|
|
|
123 |
|
124 |
### Backbone Freezing
|
125 |
|
126 |
+
The ViT used by OpenAI was already trained on 400 million images, and it is the element in our architecture that probably requires the least amount of training.
|
127 |
The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
|
128 |
+
Only after these layers converged we unfroze the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
|
129 |
|
130 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
|
131 |
|
|
|
145 |
The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
|
146 |
Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
|
147 |
also converges significantly faster! The blue line shows the results when
|
148 |
+
fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy, and you can see the
|
149 |
results in the light blue loss. Nonetheless, as common in deep learning, having more data played a big role and was another key element
|
150 |
to reduce the loss.
|
151 |
|
152 |
|
153 |
# Scientific Validity
|
154 |
|
155 |
+
We split this section in two: we first provide a quantitative evaluation to ensure that what we are learning is in fact good.
|
156 |
We then show some qualitative examples of images found by the model. **All the code we have written** to run our validation experiments (in combination with
|
157 |
code made available by Nils Reimers and by the authors of the original CLIP) is available.
|
158 |
|
|
|
194 |
| MRR@5 | **0.5039** | 0.3957|
|
195 |
| MRR@10 | **0.5204** | 0.4129|
|
196 |
|
197 |
+
_If the table above does not show, you can have a look at it [here](https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/table_imagenet.png)._
|
198 |
|
199 |
It is true that we used the training set of MSCOCO-IT in training, and this might give us an advantage. However, the original CLIP model was trained
|
200 |
on 400million images (and some of them might have been from MSCOCO).
|
|
|
237 |
Here's "a yellow flower"
|
238 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_giallo.png" alt="drawing" width="600"/>
|
239 |
|
240 |
+
And here's "a blue flower"
|
241 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/fiore_blu.png" alt="drawing" width="600"/>
|
242 |
|
243 |
### Counting
|