Fine-tuned TinyBERT_General_4L_312D model for claim detection binary classification

The task is formulated as a binary classification of determining if the input text is a claim or not.

This model is a fine-tuned version of the TinyBERT_General_4L_312D. The model was finetuded on a ClaimBuster dataset (http://doi.org/10.5281/zenodo.3609356). For the training we used only -1 and ( merged 1 and 0) labels, corresponding to No and Yes decision on if the text can be considered as a claim.

The model was trained on the full dataset after the evaluation

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("yevhenkost/claim-detection-claimbuster-binary-TinyBERT_General_4L_312D")
tokenizer = AutoTokenizer.from_pretrained("yevhenkost/claim-detection-claimbuster-binary-TinyBERT_General_4L_312D")

text_inputs = ["The water is wet"]

model_inputs = tokenizer(text_inputs, return_tensors="pt")

# regular SequenceClassifierOutput
model_output = model(**model_inputs)

# logits location to decision
decoding_dict = {0:"No", 1:"Yes"}

# model_output.logits tensor of shape (BATCH SIZE, 2);

Training Process

Data Preparation

The files were donwloaded from the ClaimBuster url. The dataset was prepared in the following way:

import pandas as pd
from sklearn.model_selection import train_test_split

# read data
gt_df = pd.read_csv("groundtruth.csv")
cs_df = pd.read_csv("crowdsourced.csv")

# concatenate and filter labels
total_df = pd.concat(
    [cs_df, gt_df]
)

total_df['labels'] = total_df["Verdict"].apply(lambda x: 0 if x == -1 else 1)

# split on train and test
train_df, test_df = train_test_split(total_df, test_size=0.2, random_state=2)

Test Result

              precision    recall  f1-score   support

          No       0.90      0.85      0.88      3126
         Yes       0.74      0.82      0.78      1581

    accuracy                           0.84      4707
   macro avg       0.82      0.84      0.83      4707
weighted avg       0.85      0.84      0.84      4707