Model Details

The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.

Model Date

January 2024

Model Type

The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder. These encoders initialized weight for openai/clip-vit-large-patch14 before training. It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.

Input: image and text

output: image and text representation

Uses

Use with Transformers

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

processor = AutoProcessor.from_pretrained("humane-lab/CFT-CLIP")
model = AutoModel.from_pretrained("humane-lab/CFT-CLIP")


image = "cat.jpg"
image = Image.open(image)
inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")

outputs = model(**inputs)
text_embeds = outputs.text_embeds
image_embeds = outputs.image_embeds

Intended Use

The model is intended as a research output for research communities.

Primary intended uses

The primary intended users of these models are AI researchers.

Out-of-Scope Use Cases

The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.

Factors

Relevant factors

We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler. The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set. Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.

Evaluation factors

We conducted a threshold-based evaluation about NewsTT. At this time, we optimized the validation.

Metrics

Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.

Decision thresholds: Validation cosine-similarity based.

Approaches to uncertainty and variability: Measure by changing the random seed 5 times

Data

Training Data

The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available BBC English Dataset. The original implementation had two variants: one using a NELA-GT-2021 and the other using the titles instead of summary text from BBC Dataset.

Evaluation Data

In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set. For more details, please refer to NewsTT.

Evaluation

we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT.

Model	F1	Spearman
CFT-CLIP	0.815+-0.003	0.491+-0.005
CLIPAdapt	0.767+-0.006	0.459+-0.004
CLIP	0.763	0.409
BLIP	0.737	0.408
BLIP-2	0.707	0.415
BLIP-2+SBERT	0.694	0.341

Ethical Considerations

For pretraining, this study used publicly available news articles shared by news media. While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news. Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP. A user should be cautious about applying the method to problems in a general context and be aware of a potential bias.