--- license: mit datasets: - imirandam/TROHN-Img --- # Model Card for CLIP_TROHN-Img ## Model Description - **Homepage:** https://imirandam.github.io/BiVLC_project_page/ - **Repository:** https://github.com/IMirandaM/BiVLC - **Paper:** - **Point of Contact:** [Imanol Miranda](mailto:imanol.miranda@ehu.eus) ### Model Summary CLIP_TROHN-Img is a model presented in the [BiVLC](https://github.com/IMirandaM/BiVLC) paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative pairs, i.e., negative captions and negative images. The negatives present small compositional changes. Hyperparameters: * Learning rate: 1e-6. * Scheduler: Cosine scheduler with 50 warmup steps. * Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1. * Loss function: InfoNCE Loss. * Batch size: We define a batch size of 200, and then we add negatives. It results in 400 images x 400 captions (200 positive + 200 hard negatives). * Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set. * Data: It is fine-tuned with [TROHN-Img](https://huggingface.co/datasets/imirandam/TROHN-Img) dataset. ### Evaluation Data The model is evaluated in [BiVLC](https://huggingface.co/datasets/imirandam/BiVLC). ### Licensing Information This work is licensed under a MIT License. ## Citation Information If you find this dataset useful, please consider citing our paper: ``` @inproceedings{, title={}, author={}, booktitle={}, year={} } ```