lytang commited on
Commit
a9ce7f5
1 Parent(s): e761c0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -12
README.md CHANGED
@@ -22,21 +22,20 @@ on the combination of 35K data:
22
 
23
 
24
  ### Model Variants
25
- We also have other two MiniCheck model variants:
26
- - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)
27
- - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large)
28
 
29
 
30
  ### Model Performance
31
 
32
  <p align="center">
33
- <img src="./cost-vs-bacc.png" width="360">
34
  </p>
35
 
36
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
37
- from 10 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-DeBERTa-v3-Large outperform all
38
- exisiting specialized fact-checkers with a similar scale by a large margin but is 2% worse than our best model MiniCheck-Flan-T5-Large, which
39
- is on par with GPT-4 but 400x cheaper. See full results in our work.
40
 
41
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
42
  any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
@@ -52,12 +51,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
52
 
53
  ```python
54
  from minicheck.minicheck import MiniCheck
 
 
 
55
  doc = "A group of students gather in the school library to study for their upcoming final exams."
56
  claim_1 = "The students are preparing for an examination."
57
  claim_2 = "The students are on vacation."
58
 
59
- # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
60
- scorer = MiniCheck(model_name='deberta-v3-large', device=f'cuda:0', cache_dir='./ckpts')
61
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
62
 
63
  print(pred_label) # [1, 0]
@@ -70,14 +72,16 @@ print(raw_prob) # [0.9786180257797241, 0.01138285268098116]
70
  import pandas as pd
71
  from datasets import load_dataset
72
  from minicheck.minicheck import MiniCheck
 
 
73
 
74
- # load 13K test data
75
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
76
  docs = df.doc.values
77
  claims = df.claim.values
78
 
79
- scorer = MiniCheck(model_name='deberta-v3-large', device=f'cuda:0', cache_dir='./ckpts')
80
- pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 15 mins, depending on hardware
81
  ```
82
 
83
  To evalaute the result on the benchmark
 
22
 
23
 
24
  ### Model Variants
25
+ - [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
26
+ - [lytang/MiniCheck-Flan-T5-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large) (Model Size: 0.8B)
27
+ - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
28
 
29
 
30
  ### Model Performance
31
 
32
  <p align="center">
33
+ <img src="./performance_focused.png" width="550">
34
  </p>
35
 
36
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
37
+ from 11 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-DeBERTa-v3-Large outperform all
38
+ exisiting specialized fact-checkers with a similar scale. See full results in our work.
 
39
 
40
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
41
  any format, such as injecting certain error types into model-generated claims. Those edited claims do not reflect
 
51
 
52
  ```python
53
  from minicheck.minicheck import MiniCheck
54
+ import os
55
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
56
+
57
  doc = "A group of students gather in the school library to study for their upcoming final exams."
58
  claim_1 = "The students are preparing for an examination."
59
  claim_2 = "The students are on vacation."
60
 
61
+ # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
62
+ scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts')
63
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
64
 
65
  print(pred_label) # [1, 0]
 
72
  import pandas as pd
73
  from datasets import load_dataset
74
  from minicheck.minicheck import MiniCheck
75
+ import os
76
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
77
 
78
+ # load 29K test data
79
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
80
  docs = df.doc.values
81
  claims = df.claim.values
82
 
83
+ scorer = MiniCheck(model_name='deberta-v3-large', cache_dir='./ckpts')
84
+ pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 800 docs/min, depending on hardware
85
  ```
86
 
87
  To evalaute the result on the benchmark