Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

g8a9 commited on Jul 23, 2021

Commit

8f903f5

•

1 Parent(s): 79372f7

Minor changes

Browse files

Files changed (1) hide show

introduction.md +9 -9

introduction.md CHANGED Viewed

@@ -35,14 +35,14 @@ different applications that can start from here.
 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
-To get competitive results we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.
 For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
 that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
-loss. The optimizer we used gave us great performance and huge conversion speed, more data and augmentations helped a lot in generalizing,
 working on the training and on the loss gave us the final increase that you can see in the results.
 ## More and Better Data
@@ -103,9 +103,9 @@ that those interested can check the quality. The Google Sheet is [here](https://
 ## Better Augmentations
-We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore we implemented heavy augmentations to make the training more data efficient.
 They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
-While we would have liked to have augmentations for the captions as well after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
 ## Better Training
@@ -123,7 +123,7 @@ Our implementation is available online [here](https://github.com/clip-italian/cl
 ### Backbone Freezing
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
-The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones we decided to do a first training with the backbones of our architecture completely frozen.
 Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
@@ -137,11 +137,11 @@ We got this idea from Nils' [video](https://youtu.be/RHXZKUr8qOY) on sentence em
 ### Effect of Our Edits
-The following picture showcase the effect that these edits have had on our evaluation loss:
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
-The purple line is the original training without any of our improvements, you can see that we needed a lot of training steps to get the loss down.
 Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
 also converges significantly faster! The blue line shows the results when
 fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
@@ -266,8 +266,8 @@ early 1900 and is part of the largest movie studios in Europe (Cinecittà). A se
 Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
 finds difficult to count after three; this is a general limitation that is common to many models of this type.
-There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data the model is exposed to many biases such as sexism, racism, stereotypes,
-slurs and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
 sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
 While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.

 The original CLIP model was trained on 400 million image-text pairs; this amount of data is currently not available for Italian.
 We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
+To get competitive results, we followed three strategies:
   1. more and better data;
   2. better augmentations;
   3. better training strategies.
 For those interested, we have a :comet: [Comet](https://www.comet.ml/g8a9/clip-italian/reports/clip-italian-training-metrics) report
 that shows a **subset** of the experiments we ran. Different hyper-parameters played a role in reducing the validation
+loss. The optimizer we used gave us great performance and fast convergence, more data and augmentations helped a lot in generalizing,
 working on the training and on the loss gave us the final increase that you can see in the results.
 ## More and Better Data
 ## Better Augmentations
+We knew that without a good augmentation strategy we could never get competitive results to a model trained on 400 million images. Therefore, we implemented heavy augmentations to make the training more data efficient.
 They include random affine transformations and perspective changes, as well as occasional equalization and random changes to brightness, contrast, saturation and hue. We made sure to keep hue augmentations limited however, to still give the model the ability to learn color definitions.
+While we would have liked to have augmentations for the captions as well, after some experimentation we settled with random sampling from the five captions available in MSCOCO and leaving the rest of the captions unmodified.
 ## Better Training
 ### Backbone Freezing
 The ViT used by OpenAI was already trained on 400 million images and it is the element in our architecture that probably required the least training.
+The same is true for the BERT model we use. To allow the randomly initialized re-projection layers to warm up without messing with the tuned weights of the backbones, we decided to do a first training with the backbones of our architecture completely frozen.
 Only after these layers converged we unfreezed the rest of the model to fine-tune all the components. This technique allowed us to reach a much better validation loss.
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="95%"/>
 ### Effect of Our Edits
+The following picture showcases the effect that these edits have had on our evaluation loss:
 <img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/improvements.png" alt="drawing" width="95%"/>
+The purple line is the original training without any of our improvements: you can see that we needed a lot of training steps to get the loss down.
 Yellow line is the loss with the new optimizer, it is **striking** to see the time we save from this addition! Not only the loss improves, it
 also converges significantly faster! The blue line shows the results when
 fixed scaling is used in addition to the new optimizer. Finally, we added the backbone freezing strategy and you can see the
 Currently, the model is not without limits. To mention one, its counting capabilities seem very cool, but from our experiments the model
 finds difficult to count after three; this is a general limitation that is common to many models of this type.
+There are even more evident issues that we found in our model. Due to the unfiltered nature of our training data, the model is exposed to many biases such as sexism, racism, stereotypes,
+slurs, and gore that it might replicate without the awareness of their hurtful and harmful nature. Indeed, different BERT models - Italian ones included - are prone to create stereotyped
 sentences that are hurtful ([Nozza et al., 2021](https://www.aclweb.org/anthology/2021.naacl-main.191.pdf)).
 While this is not something we intended, it certainly is something that we share the blame for since we were not able to avoid it.