vinid commited on
Commit
895d7f9
1 Parent(s): 1a47e59

adding links to the citations

Browse files
Files changed (1) hide show
  1. readme.md +13 -9
readme.md CHANGED
@@ -28,7 +28,8 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
28
 
29
  We considered three main sources of data:
30
 
31
- + WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 
32
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
33
  On the other hand, this text is written in Italian and it is good quality.
34
  To prevent polluting the data with captions that are not meaningful, we used POS tagging
@@ -36,11 +37,12 @@ However, this kind of text, without more information, is not useful to learn a g
36
 
37
  Example: ....
38
 
39
- + MSCOCO-IT. This image-caption dataset comes from the work by Antonio et al., 2019. The captions comes from the original
40
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
41
  100K images, for each image more than one caption is available.
42
 
43
- + Conceptual Captions. This image-caption dataset comes from the work by Sharma et al., 2018. There are more than 3mln image-caption pairs in
 
44
  this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
45
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
46
  a dataset with 700K translated captions.
@@ -73,7 +75,7 @@ To better understand how well our clip-italian model works we run an experimenta
73
 
74
  The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
75
  [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
76
- that was created through multilingual knowledge distillation (see Reimers et al., 2020).
77
 
78
  ### Experiments Replication
79
  We provide two colab notebooks to replicate both experiments.
@@ -117,7 +119,7 @@ This experiment replicates the original one run by OpenAI on zero-shot image cla
117
 
118
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
119
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
120
- paper (see, Radford et al., 2021), considering that our results are in line with those obtained by mCLIP we think that
121
  the translated image labels might have had an impact on the final scores.
122
 
123
  ## Qualitative Evaluation
@@ -130,13 +132,15 @@ the translated image labels might have had an impact on the final scores.
130
 
131
  # References
132
 
133
- Antonio, S., Croce, D., & Basili, R. (2019). Large scale datasets for Image and Video Captioning in Italian. IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.
134
 
135
- Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
136
 
137
- Reimers, N., & Gurevych, I. (2020, November). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
138
 
139
- Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
 
 
140
 
141
  # Other Notes
142
  This readme has been designed using resources from Flaticon.com
 
28
 
29
  We considered three main sources of data:
30
 
31
+ + [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
32
+ [Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
33
  However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
34
  On the other hand, this text is written in Italian and it is good quality.
35
  To prevent polluting the data with captions that are not meaningful, we used POS tagging
 
37
 
38
  Example: ....
39
 
40
+ + [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
41
  MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
42
  100K images, for each image more than one caption is available.
43
 
44
+ + [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
45
+ the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
46
  this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
47
  could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
48
  a dataset with 700K translated captions.
 
75
 
76
  The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
77
  [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
78
+ that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
79
 
80
  ### Experiments Replication
81
  We provide two colab notebooks to replicate both experiments.
 
119
 
120
  Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
121
  we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
122
+ paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)), considering that our results are in line with those obtained by mCLIP we think that
123
  the translated image labels might have had an impact on the final scores.
124
 
125
  ## Qualitative Evaluation
 
132
 
133
  # References
134
 
135
+ Scaiella, A., Croce, D., & Basili, R. (2019). [Large scale datasets for Image and Video Captioning in Italian.](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf) IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.
136
 
137
+ Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.](https://aclanthology.org/P18-1238.pdf) In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
138
 
139
+ Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
140
 
141
+ Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
142
+
143
+ Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
144
 
145
  # Other Notes
146
  This readme has been designed using resources from Flaticon.com