Spaces:
Running
Running
File size: 10,427 Bytes
80200b5 f1abd41 80200b5 acfaaf8 2175e1c acfaaf8 2175e1c 4a0f49b 2175e1c 80200b5 b7ddea4 acfaaf8 9ea982d acfaaf8 9ea982d acfaaf8 80200b5 9ea982d 80200b5 e5ec521 4a0f49b e5ec521 895d7f9 9ea982d acfaaf8 f1abd41 9ea982d acfaaf8 f1abd41 b7ddea4 895d7f9 1a47e59 b7ddea4 895d7f9 1a47e59 e5ec521 80200b5 2175e1c e5ec521 f1abd41 e5ec521 f1abd41 ad189b5 80200b5 b7ddea4 6576840 e5ec521 80200b5 6576840 e5ec521 1a47e59 895d7f9 1a47e59 6576840 608a0a7 6576840 e5ec521 80200b5 e5ec521 b7ddea4 80200b5 1a47e59 e5ec521 847c91a 3140e4f 1a47e59 5fa6a85 e5ec521 1a47e59 f1abd41 1a47e59 80200b5 4a0f49b e5ec521 4f04fa9 d0f6c44 5fa6a85 e5ec521 1a47e59 f1abd41 1a47e59 6576840 9ea982d 1a47e59 9ea982d 1a47e59 e5ec521 3140e4f 1a47e59 895d7f9 1a47e59 895d7f9 1a47e59 895d7f9 e5ec521 895d7f9 e5ec521 b7ddea4 e5ec521 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# Italian CLIP
With a few tricks, we have been able to fine-tune a competitive Italian CLIP model with **only 1.4 million** training samples. Our Italian CLIP model
is built upon the [Italian BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) model provided by [dbmdz](https://huggingface.co/dbmdz) and the OpenAI
[vision transformer](https://huggingface.co/openai/clip-vit-base-patch32).
In building this project we kept in mind the following principles:
+ **Novel Contributions**: We created a dataset of ~1.4 million Italian image-text pairs and, to the best of our knowledge, we trained the best Italian CLIP model currently in existence;
+ **Scientific Validity**: Claim are easy, facts are hard. That's why validation is important to assess the real impact of a model. We thoroughly evaluated our models in several tasks and made the validation reproducible for everybody.
+ **Broader Outlook**: We always kept in mind which are the possible usages for this model.
We put our **hearts** and **souls** into the project during this week! Not only did we work on a cool project, but we were
able to make new friends and and learn a lot from each other to work towards a common goal!
Thank you for this amazing opportunity, we hope you will like the results. :heart:
# Novel Contributions
The original CLIP model was trained on 400 million image-text pairs; this amount of data is not available for Italian.
We indeed worked in a **low-resource setting**. The only datasets for Italian captioning in the literature are MSCOCO-IT (a translated version of MSCOCO) and WIT.
To get competitive results we followed three strategies:
1. more and better data;
2. better augmentations;
3. better training.
## More and Better Data
We eventually had to deal with the fact that we do not have the same data that OpenAI had during the training of CLIP.
Thus, we tried to add as much data as possible while keeping the data-quality as high as possible.
We considered three main sources of data:
+ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). We focused on the *Reference Description* captions described in the paper as they are
the ones of highest quality. Nonetheless, many of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
On the other hand, this text is written in Italian and it is of good quality. We cannot just remove short captions as some of those
are still good (e.g., "running dog"). Thus, to prevent polluting the data with captions that are not meaningful, we used *POS tagging*
on the text and removed all the captions that were composed for the 80% or more by PROPN (around ~10% of the data). This is a simple solution that allowed us to retain much
of the dataset, without introducing noise.
Captions like: *'Dora Riparia', 'Anna Maria Mozzoni', 'Joey Ramone Place', 'Kim Rhodes', 'Ralph George Hawtrey' * have been removed.
+ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
100K images, for each image more than one caption is available.
+ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
a dataset with 700K translated captions.
## Better Augmentations
## Better Training
After different trials, we realized that the usual way of training this model was
not good enough to get good results. We thus modified two different parts of the
training pipeline: the optimizer and the training with frozen components.
### Optimizer
The standard AdamW didn't seem enough to train the model and thus we opted for a different optimization strategy. We eventually used AdaBelief with AGC and Cosine Annealing.
Our implementation is available online [here](https://github.com/clip-italian/clip-italian/blob/master/hybrid_clip/run_hybrid_clip.py#L667).
### Backbone Freezing
The ViT used by OpenAI was already trained on 400million images and it is the element in our architecture that probably required less training.
The same is true for the BERT model we use. Thus, we decided to do a first training with the backbone of our architecture completely frozen, to allow
the deeper layer to adapt to the new setting. Eventually, we run a new training, by fine-tuning al the components. This technique allowed us to
reach a much better validation loss.
<img src="https://huggingface.co/spaces/clip-italian/clip-italian-demo/raw/main/static/img/clip-italian.png" alt="drawing" width="600"/>
# Scientific Validity
## Quantitative Evaluation
Those images are definitely cool and interesting, but a model is nothing without validation.
To better understand how well our clip-italian model works we run an experimental evaluation. Since this is the first clip-based model in Italian, we used the multilingual CLIP model as a comparison baseline.
### mCLIP
The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
[sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
### Experiments Replication
We provide two colab notebooks to replicate both experiments.
### Tasks
We selected two different tasks:
+ image-retrieval
+ zero-shot classification
### Image Retrieval
This experiment is run against the MSCOCO-IT validation set (that we haven't used in training). Given in input
a caption, we search for the most similar image in the MSCOCO-IT validation set. As evaluation metrics
we use the MRR.
| MRR | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| MRR@1 | **0.3797** | 0.2874|
| MRR@5 | **0.5039** | 0.3957|
| MRR@10 | **0.5204** | 0.4129|
It is true that we used MSCOCO-IT in training, and this might give us an advantage. However the original CLIP model was trained
on 400million images (and some of them probably were from MSCOCO).
[Colab: Image Retrieval Evaluation](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)
### Zero-shot image classification
This experiment replicates the original one run by OpenAI on zero-shot image classification on ImageNet. To do this, we used DeepL to
translate the image labels in ImageNet with DeepL. We evaluate the models computing the accuracy.
| Accuracy | CLIP-Italian | mCLIP |
| --------------- | ------------ |-------|
| Accuracy@1 | **22.11** | 20.15 |
| Accuracy@5 | **43.69** | 36.57 |
| Accuracy@10 | **52.55** | 42.91 |
| Accuracy@100 | **81.08** | 67.11 |
[Colab: ImageNet Zero Shot Evaluation](https://colab.research.google.com/drive/1zfWeVWY79XXH63Ci-pk8xxx3Vu_RRgW-?usp=sharing)
Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)). However, considering that our results are in line with those obtained by mCLIP we think that
the translated image labels might have had an impact on the final scores.
## Qualitative Evaluation
We hereby show some very interesting properties of the model. The first one is its ability to detect colors and the second one is its (partial) counting
ability. To our own surprise, many of the answers the model gives make a lot of sense!
### Colors
### Counting
# Broader Outlook
We believe that this model can be useful for many different applications, not only in research settings. Italy has many different collections
of photos in digital format. For example, the [Istituto Luce Cinecittà](https://it.wikipedia.org/wiki/Istituto_Luce_Cinecitt%C3%A0) is an Italian governative entity that collects photos of Italy since the
early 1900 and it is part of the largest movie studios in Europe (Cinecittà).
# References
Scaiella, A., Croce, D., & Basili, R. (2019). [Large scale datasets for Image and Video Captioning in Italian.](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf) IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.
Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.](https://aclanthology.org/P18-1238.pdf) In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
# Other Notes
This readme has been designed using resources from Flaticon.com |