--- language: en license: apache-2.0 datasets: - claimbuster metrics: - f1 - accuracy - precision - recall pipeline_tag: text-classification --- # Fine-tuned bert-base-cased model for claims checkworthiness binary classification The task is formulated as a binary classification task of determining if the claim (text) is worth fact-checking.
This model is a fine-tuned version of the [BERT base cased model](https://huggingface.co/bert-base-cased). The model was finetuded on a [ClaimBuster dataset](https://zenodo.org/records/3609356) (http://doi.org/10.5281/zenodo.3609356). For the training we used only 0 and 1 labels, corresponding to Yes and No decision on whether the claim is considered to be chack-worthy. The model was trained on the full dataset after the evaluation # Usage ## BertForSequenceClassification ```python from transformers import BertTokenizer,BertForSequenceClassification model = BertForSequenceClassification.from_pretrained("yevhenkost/claimbuster-yesno-binary-bert-base-cased") tokenizer = BertTokenizer.from_pretrained("yevhenkost/claimbuster-yesno-binary-bert-base-cased") text_inputs = ["The water is wet"] model_inputs = tokenizer(text_inputs, return_tensors="pt") # regular SequenceClassifierOutput model_output = model(**model_inputs) # model_output.logits tensor([[-0.2657, 0.0749]]) ``` ## Pipeline ```python ``` # Training Process ## Data Preparation The files were donwloaded from the ClaimBuster url. The dataset was prepared in the following way: ```python import pandas as pd from sklearn.model_selection import train_test_split # read data gt_df = pd.read_csv("groundtruth.csv") cs_df = pd.read_csv("crowdsourced.csv") # concatenate and filter labels total_df = pd.concat( [cs_df, gt_df] ) total_df = total_df[total_df["Verdict"].isin([0,1])] # split on train and test train_df, test_df = train_test_split(total_df, test_size=0.2, random_state=2) ``` ## Test Result ```python precision recall f1-score support No 0.74 0.57 0.65 485 Yes 0.83 0.91 0.87 1139 accuracy 0.81 1624 macro avg 0.79 0.74 0.76 1624 weighted avg 0.81 0.81 0.81 1624 ```