Zongxia commited on
Commit
5116a09
•
1 Parent(s): 5c0475a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -3
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: mit
4
+ language:
5
+ - en
6
+ metrics:
7
+ - exact_match
8
+ - f1
9
+ - bertscore
10
+ pipeline_tag: text-classification
11
+ ---
12
+ # QA-Evaluation-Metrics
13
+
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
+
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
+
19
+ ### Updates
20
+ - Uopdated to version 0.2.8
21
+ - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
+ - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
+
24
+
25
+ ## Installation
26
+ * Python version >= 3.6
27
+ * openai version >= 1.0
28
+
29
+
30
+ To install the package, run the following command:
31
+
32
+ ```bash
33
+ pip install qa-metrics
34
+ ```
35
+
36
+ ## Usage/Logistics
37
+
38
+ The python package currently provides six QA evaluation methods.
39
+ - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
40
+ - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
41
+ - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
42
+ - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
43
+ - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
44
+
45
+ ## Normalized Exact Match
46
+ #### `em_match`
47
+
48
+ Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
49
+
50
+ **Parameters**
51
+
52
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
53
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
54
+
55
+ **Returns**
56
+
57
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
58
+
59
+ ```python
60
+ from qa_metrics.em import em_match
61
+
62
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
63
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
64
+ match_result = em_match(reference_answer, candidate_answer)
65
+ print("Exact Match: ", match_result)
66
+ '''
67
+ Exact Match: False
68
+ '''
69
+ ```
70
+
71
+ ## F1 Score
72
+ #### `f1_score_with_precision_recall`
73
+
74
+ Calculates F1 score, precision, and recall between a reference and a candidate answer.
75
+
76
+ **Parameters**
77
+
78
+ - `reference_answer` (str): A gold (correct) answers to the question.
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
80
+
81
+ **Returns**
82
+
83
+ - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
84
+
85
+ ```python
86
+ from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
87
+
88
+ f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
89
+ print("F1 stats: ", f1_stats)
90
+ '''
91
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
92
+ '''
93
+
94
+ match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
95
+ print("F1 Match: ", match_result)
96
+ '''
97
+ F1 Match: False
98
+ '''
99
+ ```
100
+
101
+ ## Efficient and Robust Question/Answer Type Evaluation
102
+ #### 1. `get_highest_score`
103
+
104
+ Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
105
+
106
+ **Parameters**
107
+
108
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
109
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
110
+ - `question` (str): The question for which the answers are being evaluated.
111
+
112
+ **Returns**
113
+
114
+ - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
115
+
116
+ #### 2. `get_scores`
117
+
118
+ Returns all the gold answer and candidate answer pairs' matching scores.
119
+
120
+ **Parameters**
121
+
122
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
123
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
124
+ - `question` (str): The question for which the answers are being evaluated.
125
+
126
+ **Returns**
127
+
128
+ - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
129
+
130
+ #### 3. `evaluate`
131
+
132
+ Returns True if the candidate answer is a match of any of the gold answers.
133
+
134
+ **Parameters**
135
+
136
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
137
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
138
+ - `question` (str): The question for which the answers are being evaluated.
139
+
140
+ **Returns**
141
+
142
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
143
+
144
+
145
+ ```python
146
+ from qa_metrics.pedant import PEDANT
147
+
148
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
149
+ pedant = PEDANT()
150
+ scores = pedant.get_scores(reference_answer, candidate_answer, question)
151
+ max_pair, highest_scores = pedant.get_highest_score(reference_answer, candidate_answer, question)
152
+ match_result = pedant.evaluate(reference_answer, candidate_answer, question)
153
+ print("Max Pair: %s; Highest Score: %s" % (max_pair, highest_scores))
154
+ print("Score: %s; PANDA Match: %s" % (scores, match_result))
155
+ '''
156
+ Max Pair: ('the princess and the frog', 'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"'); Highest Score: 0.854451712151719
157
+ Score: {'the frog prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7131625951317375}, 'the princess and the frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.854451712151719}}; PANDA Match: True
158
+ '''
159
+ ```
160
+
161
+ ```python
162
+ print(pedant.get_score(reference_answer[1], candidate_answer, question))
163
+ '''
164
+ 0.7122460127464126
165
+ '''
166
+ ```
167
+
168
+ ## Transformer Neural Evaluation
169
+ Our fine-tuned BERT model is on 🤗 [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! 🔥🔥🔥
170
+
171
+ #### `transformer_match`
172
+
173
+ Returns True if the candidate answer is a match of any of the gold answers.
174
+
175
+ **Parameters**
176
+
177
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
178
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
179
+ - `question` (str): The question for which the answers are being evaluated.
180
+
181
+ **Returns**
182
+
183
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
184
+
185
+ ```python
186
+ from qa_metrics.transformerMatcher import TransformerMatcher
187
+
188
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
189
+ # Supported models: roberta-large, roberta, bert, distilbert, distilroberta
190
+ tm = TransformerMatcher("roberta-large")
191
+ scores = tm.get_scores(reference_answer, candidate_answer, question)
192
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
193
+ print("Score: %s; bert Match: %s" % (scores, match_result))
194
+ '''
195
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
196
+ '''
197
+ ```
198
+
199
+ ## Prompting LLM For Evaluation
200
+
201
+ Note: The prompting function can be used for any prompting purposes.
202
+
203
+ ###### OpenAI
204
+ ```python
205
+ from qa_metrics.prompt_llm import CloseLLM
206
+ model = CloseLLM()
207
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
208
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
209
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
210
+
211
+ '''
212
+ 'correct'
213
+ '''
214
+ ```
215
+
216
+ ###### Anthropic
217
+ ```python
218
+ model = CloseLLM()
219
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
220
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
221
+
222
+ '''
223
+ 'correct'
224
+ '''
225
+ ```
226
+
227
+ ###### deepinfra (See below for descriptions of more models)
228
+ ```python
229
+ from qa_metrics.prompt_open_llm import OpenLLM
230
+ model = OpenLLM()
231
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
232
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
233
+
234
+ '''
235
+ 'correct'
236
+ '''
237
+ ```
238
+
239
+ If you find this repo avialable, please cite our paper:
240
+ ```bibtex
241
+ @misc{li2024panda,
242
+ title={PANDA (Pedantic ANswer-correctness Determination and Adjudication):Improving Automatic Evaluation for Question Answering and Text Generation},
243
+ author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
244
+ year={2024},
245
+ eprint={2402.11161},
246
+ archivePrefix={arXiv},
247
+ primaryClass={cs.CL}
248
+ }
249
+ ```
250
+
251
+
252
+ ## Updates
253
+ - [01/24/24] 🔥 The full paper is uploaded and can be accessed [here](https://arxiv.org/abs/2402.11161). The dataset is expanded and leaderboard is updated.
254
+ - Our Training Dataset is adapted and augmented from [Bulian et al](https://github.com/google-research-datasets/answer-equivalence-dataset). Our [dataset repo](https://github.com/zli12321/Answer_Equivalence_Dataset.git) includes the augmented training set and QA evaluation testing sets discussed in our paper.
255
+ - Now our model supports [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), a smaller and more robust matching model than Bert!
256
+
257
+ ## License
258
+
259
+ This project is licensed under the [MIT License](LICENSE.md) - see the LICENSE file for details.
260
+
261
+ ## Contact
262
+
263
+ For any additional questions or comments, please contact [[email protected]].