Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

Silvia Terragni commited on Jul 18, 2021

Commit

acfaaf8

•

1 Parent(s): 5fa6a85

Update README.md

Browse files

Files changed (1) hide show

readme.md +17 -9

readme.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # Italian CLIP
-With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with only 1.4 million training samples.
-In building this project we kept in mind the following things:
-+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and to our knowledge trained the best Italian CLIP model currently in existence;
-+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models and made the validation reproducible for everybody.
-+ **Broader Outlook**: we always considered which are the possible usages for this model.
 We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
 able to make new friends and and learn a lot from each other to work towards a common goal!
@@ -14,7 +14,12 @@ Thank you for this amazing opportunity, we hope you will like the results. :hear
 # Novel Contributions
-The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian and the only datasets for captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT. To get competitive results we followed three strategies: 1) more data, 2) better augmentations and 3) better training.
 ## More Data
@@ -24,9 +29,12 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
 We considered three main sources of data:
 + WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
-However, this kind of text, without more information, is not useful to learn a good mapping between images and captions. On the other hand,
-this text is written in Italian and it is good quality. To prevent polluting the data with captions that are not meaningful, we used POS tagging
-on the data and removed all the captions that were composed for the 80% or more by PROPN.
 + MSCOCO-IT.

 # Italian CLIP
+With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples.
+In building this project we kept in mind the following principles:
++ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
++ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody.
++ **Broader Outlook**: We always kept in mind which are the possible usages for this model.
 We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
 able to make new friends and and learn a lot from each other to work towards a common goal!
 # Novel Contributions
+The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
+We indeed worked in a **low-resource setting**. The only datasets for captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
+To get competitive results we followed three strategies:
+1. more data;
+2. better augmentations;
+3. better training.
 ## More Data
 We considered three main sources of data:
 + WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
+However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
+  On the other hand, this text is written in Italian and it is good quality.
+  To prevent polluting the data with captions that are not meaningful, we used POS tagging
+  on the data and removed all the captions that were composed for the 80% or more by PROPN.
+  Example: ....
 + MSCOCO-IT.