Spaces:

clip-italian
/

clip-italian-demo

Running

App Files Files Community

vinid commited on Jul 18, 2021

Commit

895d7f9

•

1 Parent(s): 1a47e59

adding links to the citations

Browse files

Files changed (1) hide show

readme.md +13 -9

readme.md CHANGED Viewed

@@ -28,7 +28,8 @@ Thus, we tried to add as much data as possible while keeping the data-quality as
 We considered three main sources of data:
-+ WIT. Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
   On the other hand, this text is written in Italian and it is good quality.
   To prevent polluting the data with captions that are not meaningful, we used POS tagging
@@ -36,11 +37,12 @@ However, this kind of text, without more information, is not useful to learn a g
   Example: ....
-+ MSCOCO-IT. This image-caption dataset comes from the work by Antonio et al., 2019. The captions comes from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 100K images, for each image more than one caption is available.
-+ Conceptual Captions. This image-caption dataset comes from the work by Sharma et al., 2018. There are more than 3mln image-caption pairs in
 this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
 could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
 a dataset with 700K translated captions.
@@ -73,7 +75,7 @@ To better understand how well our clip-italian model works we run an experimenta
 The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
 [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
-that was created through multilingual knowledge distillation (see Reimers et al., 2020).
 ### Experiments Replication
 We provide two colab notebooks to replicate both experiments.
@@ -117,7 +119,7 @@ This experiment replicates the original one run by OpenAI on zero-shot image cla
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
-paper (see, Radford et al., 2021), considering that our results are in line with those obtained by mCLIP we think that
 the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation
@@ -130,13 +132,15 @@ the translated image labels might have had an impact on the final scores.
 # References
-Antonio, S., Croce, D., & Basili, R. (2019). Large scale datasets for Image and Video Captioning in Italian. IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.
-Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
-Reimers, N., & Gurevych, I. (2020, November). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
-Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
 # Other Notes
 This readme has been designed using resources from Flaticon.com

 We considered three main sources of data:
++ [WIT](https://github.com/google-research-datasets/wit) is an image-caption dataset collected from Wikipedia (see,
+[Srinivasan et al., 2021](https://arxiv.org/pdf/2103.01913.pdf)). Most of these captions describe ontological knowledge and encyclopedic facts (e.g., Roberto Baggio in 1994).
 However, this kind of text, without more information, is not useful to learn a good mapping between images and captions.
   On the other hand, this text is written in Italian and it is good quality.
   To prevent polluting the data with captions that are not meaningful, we used POS tagging
   Example: ....
++ [MSCOCO-IT](https://github.com/crux82/mscoco-it). This image-caption dataset comes from the work by [Scaiella et al., 2019](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf). The captions comes from the original
 MSCOCO dataset and have been translated with Microsoft Translator. The 2017 version of the MSCOCO training set contains more than
 100K images, for each image more than one caption is available.
++ [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/). This image-caption dataset comes from
+the work by [Sharma et al., 2018](https://aclanthology.org/P18-1238.pdf). There are more than 3mln image-caption pairs in
 this dataset and these have been collected from the web. We downloaded the images with the URLs provided by the dataset, but we
 could not retrieve them all. Eventually, we had to translate the captions to Italian. We have been able to collect
 a dataset with 700K translated captions.
 The multilingual CLIP (henceforth, mCLIP), is a model introduced by [Nils Reimers](https://www.sbert.net/docs/pretrained_models.html) in his
 [sentence-transformer](https://www.sbert.net/index.html) library. mCLIP is based on a multilingual encoder
+that was created through multilingual knowledge distillation (see [Reimers et al., 2020](https://aclanthology.org/2020.emnlp-main.365/)).
 ### Experiments Replication
 We provide two colab notebooks to replicate both experiments.
 Our results confirm that CLIP-Italian is very competitive and beats mCLIP on the two different task
 we have been testing. Note, however, that our results are lower than those shown in the original OpenAI
+paper (see, [Radford et al., 2021](https://arxiv.org/abs/2103.00020)), considering that our results are in line with those obtained by mCLIP we think that
 the translated image labels might have had an impact on the final scores.
 ## Qualitative Evaluation
 # References
+Scaiella, A., Croce, D., & Basili, R. (2019). [Large scale datasets for Image and Video Captioning in Italian.](http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf) IJCoL. Italian Journal of Computational Linguistics, 5(5-2), 49-60.
+Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018, July). [Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.](https://aclanthology.org/P18-1238.pdf) In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2556-2565).
+Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. (2021). [WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning](https://arxiv.org/pdf/2103.01913.pdf). arXiv preprint arXiv:2103.01913.
+Reimers, N., & Gurevych, I. (2020, November). [Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation.](https://aclanthology.org/2020.emnlp-main.365/) In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
+Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). [Learning Transferable Visual Models From Natural Language Supervision.](https://arxiv.org/abs/2103.00020) ICML.
 # Other Notes
 This readme has been designed using resources from Flaticon.com