metadata

license: mit
datasets:
  - imirandam/TROHN-Img

Model Card for CLIP_TROHN-Img

Model Description

Homepage: https://imirandam.github.io/BiVLC_project_page/
Repository: https://github.com/IMirandaM/BiVLC
Paper: https://arxiv.org/abs/2406.09952
Point of Contact: Imanol Miranda

Model Summary

CLIP_TROHN-Img is a model presented in the BiVLC paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative pairs, i.e., negative captions and negative images. The negatives present small compositional changes. Hyperparameters:

Learning rate: 1e-6.
Scheduler: Cosine scheduler with 50 warmup steps.
Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
Loss function: InfoNCE Loss.
Batch size: We define a batch size of 200, and then we add negatives. It results in 400 images x 400 captions (200 positive + 200 hard negatives).
Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
Data: It is fine-tuned with TROHN-Img dataset.

Evaluation Data

The model is evaluated in BiVLC.

Licensing Information

This work is licensed under a MIT License.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}