Spaces:
Running
Running
Minor changes
Browse files- introduction.md +9 -9
introduction.md
CHANGED
@@ -35,14 +35,14 @@ different applications that can start from here.
|
|
35 |
|
36 |
The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
|
37 |
We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
|
38 |
-
To get competitive results we followed three strategies:
|
39 |
1. more and better data;
|
40 |
2. better augmentations;
|
41 |
3. better training strategies.
|
42 |
|
43 |
For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
|
44 |
that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
|
45 |
-
loss. The optimizer we used gave us great performance and
|
46 |
working on the training and on the loss gave us the final increase that you can see in the results.
|
47 |
|
48 |
## More and Better Data
|
@@ -103,9 +103,9 @@ that those interested can check the quality. The Google Sheet is [here](https://
|
|
103 |
|
104 |
## Better Augmentations
|
105 |
|
106 |
-
We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore we implemented heavy augmentations to make the training more data efficient.
|
107 |
They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
|
108 |
-
While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
|
109 |
|
110 |
## Better Training
|
111 |
|
@@ -123,7 +123,7 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
|
|
123 |
### Backbone Freezing
|
124 |
|
125 |
The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
|
126 |
-
The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen.
|
127 |
Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
|
128 |
|
129 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
|
@@ -137,11 +137,11 @@ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence em
|
|
137 |
|
138 |
### Effect of Our Edits
|
139 |
|
140 |
-
The following picture
|
141 |
|
142 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
|
143 |
|
144 |
-
The purple line is the original training without any of our improvements
|
145 |
Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
|
146 |
also converges significantly faster! The blue line shows the results when
|
147 |
fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
|
@@ -266,8 +266,8 @@ early 1900 and is part of the largest movie studios in Europe (Cinecittà). A se
|
|
266 |
Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
|
267 |
finds difficult to count after three; this is a general limitation that is common to many models of this type.
|
268 |
|
269 |
-
There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data the model is exposed to many biases such as sexism, racism, stereotypes,
|
270 |
-
slurs and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
|
271 |
sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
|
272 |
While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.
|
273 |
|
|
|
35 |
|
36 |
The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
|
37 |
We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
|
38 |
+
To get competitive results, we followed three strategies:
|
39 |
1. more and better data;
|
40 |
2. better augmentations;
|
41 |
3. better training strategies.
|
42 |
|
43 |
For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
|
44 |
that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
|
45 |
+
loss. The optimizer we used gave us great performance and fast convergence, more data and augmentations helped a lot in generalizing,
|
46 |
working on the training and on the loss gave us the final increase that you can see in the results.
|
47 |
|
48 |
## More and Better Data
|
|
|
103 |
|
104 |
## Better Augmentations
|
105 |
|
106 |
+
We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore, we implemented heavy augmentations to make the training more data efficient.
|
107 |
They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
|
108 |
+
While we would have liked to have augmentations for the captions as well, after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
|
109 |
|
110 |
## Better Training
|
111 |
|
|
|
123 |
### Backbone Freezing
|
124 |
|
125 |
The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
|
126 |
+
The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
|
127 |
Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
|
128 |
|
129 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
|
|
|
137 |
|
138 |
### Effect of Our Edits
|
139 |
|
140 |
+
The following picture showcases the effect that these edits have had on our evaluation loss:
|
141 |
|
142 |
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
|
143 |
|
144 |
+
The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
|
145 |
Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
|
146 |
also converges significantly faster! The blue line shows the results when
|
147 |
fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
|
|
|
266 |
Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
|
267 |
finds difficult to count after three; this is a general limitation that is common to many models of this type.
|
268 |
|
269 |
+
There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data, the model is exposed to many biases such as sexism, racism, stereotypes,
|
270 |
+
slurs, and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
|
271 |
sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
|
272 |
While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.
|
273 |
|