File size: 2,075 Bytes
6305042
 
 
 
 
 
 
 
 
 
 
 
4888df5
6305042
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8d93d0
 
 
 
 
 
 
6305042
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: mit
datasets:
- imirandam/TROHN-Img
---


# Model Card for CLIP_TROHN-Img

## Model Description
- **Homepage:** https://imirandam.github.io/BiVLC_project_page/
- **Repository:** https://github.com/IMirandaM/BiVLC
- **Paper:** https://arxiv.org/abs/2406.09952
- **Point of Contact:** [Imanol Miranda](mailto:[email protected])

### Model Summary

CLIP_TROHN-Img is a model presented in the [BiVLC](https://github.com/IMirandaM/BiVLC) paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to improve the compositional understanding of the model by adding negative pairs, i.e., negative captions and negative images. The negatives present small compositional changes. Hyperparameters:

* Learning rate: 1e-6.
* Scheduler: Cosine scheduler with 50 warmup steps.
* Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
* Loss function: InfoNCE Loss.
* Batch size: We define a batch size of 200, and then we add negatives. It results in 400 images x 400 captions (200 positive + 200 hard negatives).
* Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
* Data: It is fine-tuned with [TROHN-Img](https://huggingface.co/datasets/imirandam/TROHN-Img) dataset.

### Evaluation Data
The model is evaluated in [BiVLC](https://huggingface.co/datasets/imirandam/BiVLC).

### Licensing Information
This work is licensed under a MIT License.

## Citation Information
If you find this dataset useful, please consider citing our paper:
```
@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```