Zongxia commited on
Commit
b4a505a
•
1 Parent(s): 1ec214f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -54
README.md CHANGED
@@ -14,7 +14,7 @@ pipeline_tag: text-classification
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various efficient and basic metrics to assess the performance of QA models.
18
 
19
  ### Updates
20
  - Uopdated to version 0.2.8
@@ -33,51 +33,29 @@ To install the package, run the following command:
33
  pip install qa-metrics
34
  ```
35
 
36
- ## Usage
37
 
38
- The python package currently provides six QA evaluation methods.
 
 
 
 
 
39
 
40
- #### Prompting LLM For Evaluation
 
41
 
42
- Note: The prompting function can be used for any prompting purposes.
43
 
44
- ###### OpenAI
45
- ```python
46
- from qa_metrics.prompt_llm import CloseLLM
47
- model = CloseLLM()
48
- model.set_openai_api_key(YOUR_OPENAI_KEY)
49
- prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
50
- model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
51
 
52
- '''
53
- 'correct'
54
- '''
55
- ```
56
 
57
- ###### Anthropic
58
- ```python
59
- model = CloseLLM()
60
- model.set_anthropic_api_key(YOUR_Anthropic_KEY)
61
- model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
62
 
63
- '''
64
- 'correct'
65
- '''
66
- ```
67
 
68
- ###### deepinfra (See below for descriptions of more models)
69
- ```python
70
- from qa_metrics.prompt_open_llm import OpenLLM
71
- model = OpenLLM()
72
- model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
73
- model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
74
-
75
- '''
76
- 'correct'
77
- '''
78
- ```
79
-
80
- #### Exact Match
81
  ```python
82
  from qa_metrics.em import em_match
83
 
@@ -90,38 +68,110 @@ Exact Match: False
90
  '''
91
  ```
92
 
93
- #### Transformer Match
94
- Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! 🔥🔥🔥
95
 
96
- ```python
97
- from qa_metrics.transformerMatcher import TransformerMatcher
98
 
99
- question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
100
- tm = TransformerMatcher("distilroberta")
101
- scores = tm.get_scores(reference_answer, candidate_answer, question)
102
- match_result = tm.transformer_match(reference_answer, candidate_answer, question)
103
- print("Score: %s; TM Match: %s" % (scores, match_result))
104
- '''
105
- Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.88954514}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.9381995}}; TM Match: True
106
- '''
107
- ```
108
 
109
- #### F1 Score
110
  ```python
111
  from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
112
 
113
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
114
  print("F1 stats: ", f1_stats)
 
 
 
115
 
116
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
117
  print("F1 Match: ", match_result)
118
  '''
119
- F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
120
  F1 Match: False
121
  '''
122
  ```
123
 
124
- #### PANDA Match
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  ```python
126
  from qa_metrics.pedant import PEDANT
127
 
@@ -144,6 +194,46 @@ print(pedant.get_score(reference_answer[1], candidate_answer, question))
144
  0.7122460127464126
145
  '''
146
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ```
148
 
149
  If you find this repo avialable, please cite our paper:
 
14
  [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic and efficient metrics to assess the performance of QA models.
18
 
19
  ### Updates
20
  - Uopdated to version 0.2.8
 
33
  pip install qa-metrics
34
  ```
35
 
36
+ ## Usage/Logistics
37
 
38
+ The python package currently provides six QA evaluation methods.
39
+ - Given a set of gold answers, a candidate answer to be evaluated, and a question (if applicable), the evaluation returns True if the candidate answer matches any one of the gold answer, False otherwise.
40
+ - Different evaluation methods have distinct strictness of evaluating the correctness of a candidate answer. Some have higher correlation with human judgments than others.
41
+ - Normalized Exact Match and Question/Answer type Evaluation are the most efficient method. They are suitable for short-form QA datasets such as NQ-OPEN, Hotpot QA, TriviaQA, SQuAD, etc.
42
+ - Question/Answer Type Evaluation and Transformer Neural evaluations are cost free and suitable for short-form and longer-form QA datasets. They have higher correlation with human judgments than exact match and F1 score when the length of the gold and candidate answers become long.
43
+ - Black-box LLM evaluations are closest to human evaluations, and they are not cost-free.
44
 
45
+ ## Normalized Exact Match
46
+ #### `em_match`
47
 
48
+ Returns a boolean indicating whether there are any exact normalized matches between gold and candidate answers.
49
 
50
+ **Parameters**
 
 
 
 
 
 
51
 
52
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
53
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
 
 
54
 
55
+ **Returns**
 
 
 
 
56
 
57
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
 
 
 
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```python
60
  from qa_metrics.em import em_match
61
 
 
68
  '''
69
  ```
70
 
71
+ ## F1 Score
72
+ #### `f1_score_with_precision_recall`
73
 
74
+ Calculates F1 score, precision, and recall between a reference and a candidate answer.
 
75
 
76
+ **Parameters**
77
+
78
+ - `reference_answer` (str): A gold (correct) answers to the question.
79
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
80
+
81
+ **Returns**
82
+
83
+ - `dictionary`: A dictionary containing the F1 score, precision, and recall between a gold and candidate answer.
 
84
 
 
85
  ```python
86
  from qa_metrics.f1 import f1_match,f1_score_with_precision_recall
87
 
88
  f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
89
  print("F1 stats: ", f1_stats)
90
+ '''
91
+ F1 stats: {'f1': 0.25, 'precision': 0.6666666666666666, 'recall': 0.15384615384615385}
92
+ '''
93
 
94
  match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
95
  print("F1 Match: ", match_result)
96
  '''
 
97
  F1 Match: False
98
  '''
99
  ```
100
 
101
+ ## Transformer Neural Evaluation
102
+ Our fine-tuned BERT model is on 🤗 [Huggingface](https://huggingface.co/Zongxia/answer_equivalence_bert?text=The+goal+of+life+is+%5BMASK%5D.). Our Package also supports downloading and matching directly. [distilroberta](https://huggingface.co/Zongxia/answer_equivalence_distilroberta), [distilbert](https://huggingface.co/Zongxia/answer_equivalence_distilbert), [roberta](https://huggingface.co/Zongxia/answer_equivalence_roberta), and [roberta-large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) are also supported now! 🔥🔥🔥
103
+
104
+ #### `transformer_match`
105
+
106
+ Returns True if the candidate answer is a match of any of the gold answers.
107
+
108
+ **Parameters**
109
+
110
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
111
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
112
+ - `question` (str): The question for which the answers are being evaluated.
113
+
114
+ **Returns**
115
+
116
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
117
+
118
+ ```python
119
+ from qa_metrics.transformerMatcher import TransformerMatcher
120
+
121
+ question = "Which movie is loosley based off the Brother Grimm's Iron Henry?"
122
+ tm = TransformerMatcher("distilroberta")
123
+ scores = tm.get_scores(reference_answer, candidate_answer, question)
124
+ match_result = tm.transformer_match(reference_answer, candidate_answer, question)
125
+ print("Score: %s; bert Match: %s" % (scores, match_result))
126
+ '''
127
+ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.6934309}, 'The Princess and the Frog': {'The movie "The Princess and the Frog" is loosely based off the Brother Grimm\'s "Iron Henry"': 0.7400551}}; TM Match: True
128
+ '''
129
+ ```
130
+
131
+ ## Efficient and Robust Question/Answer Type Evaluation
132
+ #### 1. `get_highest_score`
133
+
134
+ Returns the gold answer and candidate answer pair that has the highest matching score. This function is useful for evaluating the closest match to a given candidate response based on a list of reference answers.
135
+
136
+ **Parameters**
137
+
138
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
139
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
140
+ - `question` (str): The question for which the answers are being evaluated.
141
+
142
+ **Returns**
143
+
144
+ - `dictionary`: A dictionary containing the gold answer and candidate answer that have the highest matching score.
145
+
146
+ #### 2. `get_scores`
147
+
148
+ Returns all the gold answer and candidate answer pairs' matching scores.
149
+
150
+ **Parameters**
151
+
152
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
153
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
154
+ - `question` (str): The question for which the answers are being evaluated.
155
+
156
+ **Returns**
157
+
158
+ - `dictionary`: A dictionary containing gold answers and the candidate answer's matching score.
159
+
160
+ #### 3. `evaluate`
161
+
162
+ Returns True if the candidate answer is a match of any of the gold answers.
163
+
164
+ **Parameters**
165
+
166
+ - `reference_answer` (list of str): A list of gold (correct) answers to the question.
167
+ - `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated.
168
+ - `question` (str): The question for which the answers are being evaluated.
169
+
170
+ **Returns**
171
+
172
+ - `boolean`: A boolean True/False signifying matches between reference or candidate answers.
173
+
174
+
175
  ```python
176
  from qa_metrics.pedant import PEDANT
177
 
 
194
  0.7122460127464126
195
  '''
196
  ```
197
+
198
+
199
+ ## Prompting LLM For Evaluation
200
+
201
+ Note: The prompting function can be used for any prompting purposes.
202
+
203
+ ###### OpenAI
204
+ ```python
205
+ from qa_metrics.prompt_llm import CloseLLM
206
+ model = CloseLLM()
207
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
208
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
209
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
210
+
211
+ '''
212
+ 'correct'
213
+ '''
214
+ ```
215
+
216
+ ###### Anthropic
217
+ ```python
218
+ model = CloseLLM()
219
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
220
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
221
+
222
+ '''
223
+ 'correct'
224
+ '''
225
+ ```
226
+
227
+ ###### deepinfra (See below for descriptions of more models)
228
+ ```python
229
+ from qa_metrics.prompt_open_llm import OpenLLM
230
+ model = OpenLLM()
231
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
232
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
233
+
234
+ '''
235
+ 'correct'
236
+ '''
237
  ```
238
 
239
  If you find this repo avialable, please cite our paper: