--- language: - en pipeline_tag: text-classification license: mit --- # Model Summary This is a fact-checking model from our work: 📃 [**MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents**](https://arxiv.org/pdf/2404.10774.pdf) ([GitHub Repo](https://github.com/Liyan06/MiniCheck)) The model is based on DeBERTa-v3-Large that predicts a binary label - 1 for supported and 0 for unsupported. The model is doing predictions on the *sentence-level*. It takes as input a document and a sentence and determine whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}** MiniCheck-DeBERTa-v3-Large is fine tuned from `microsoft/deberta-v3-large` ([He et al., 2023](https://arxiv.org/pdf/2111.09543.pdf)) on the combination of 35K data: - 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf)) - 14K synthetic data generated from scratch in a structed way (more details in the paper). ### Model Variants - [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B) - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B) - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B) ### Model Performance
The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact), from 11 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-DeBERTa-v3-Large outperform all exisiting specialized fact-checkers with a similar scale. See full results in our work. Note: We only evaluated the performance of our models on real claims -- without any human intervention in any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect LLMs' actual behaviors. # Model Usage Demo Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and install necessary packages from `requirements.txt`. ### Below is a simple use case ```python from minicheck.minicheck import MiniCheck import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" doc = "A group of students gather in the school library to study for their upcoming final exams." claim_1 = "The students are preparing for an examination." claim_2 = "The students are on vacation." # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B'] scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts') pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2]) print(pred_label) # [1, 0] print(raw_prob) # [0.9786180257797241, 0.01138285268098116] ``` ### Test on our [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact) Benchmark ```python import pandas as pd from datasets import load_dataset from minicheck.minicheck import MiniCheck import os os.environ["CUDA_VISIBLE_DEVICES"] = "0" # load 29K test data df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test']) docs = df.doc.values claims = df.claim.values scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts') pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 800 docs/min, depending on hardware ``` To evalaute the result on the benchmark ```python from sklearn.metrics import balanced_accuracy_score df['preds'] = pred_label result_df = pd.DataFrame(columns=['Dataset', 'BAcc']) for dataset in df.dataset.unique(): sub_df = df[df.dataset == dataset] bacc = balanced_accuracy_score(sub_df.label, sub_df.preds) * 100 result_df.loc[len(result_df)] = [dataset, bacc] result_df.loc[len(result_df)] = ['Average', result_df.BAcc.mean()] result_df.round(1) ``` # Citation ``` @misc{tang2024minicheck, title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents}, author={Liyan Tang and Philippe Laban and Greg Durrett}, year={2024}, eprint={2404.10774}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```