Update README.md
Browse files
README.md
CHANGED
@@ -11,12 +11,21 @@ pipeline_tag: text-classification
|
|
11 |
---
|
12 |
# QA-Evaluation-Metrics
|
13 |
|
14 |
-
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
|
|
|
15 |
|
16 |
-
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161),
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
|
19 |
## Installation
|
|
|
|
|
|
|
20 |
|
21 |
To install the package, run the following command:
|
22 |
|
@@ -26,20 +35,7 @@ pip install qa-metrics
|
|
26 |
|
27 |
## Usage
|
28 |
|
29 |
-
The python package currently provides
|
30 |
-
|
31 |
-
#### Exact Match
|
32 |
-
```python
|
33 |
-
from qa_metrics.em import em_match
|
34 |
-
|
35 |
-
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
|
36 |
-
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
|
37 |
-
match_result = em_match(reference_answer, candidate_answer)
|
38 |
-
print("Exact Match: ", match_result)
|
39 |
-
'''
|
40 |
-
Exact Match: False
|
41 |
-
'''
|
42 |
-
```
|
43 |
|
44 |
#### Prompting LLM For Evaluation
|
45 |
|
@@ -47,10 +43,11 @@ Note: The prompting function can be used for any prompting purposes.
|
|
47 |
|
48 |
###### OpenAI
|
49 |
```python
|
50 |
-
from qa_metrics.prompt_llm import
|
51 |
-
|
|
|
52 |
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
53 |
-
prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1,
|
54 |
|
55 |
'''
|
56 |
'correct'
|
@@ -59,14 +56,40 @@ prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tok
|
|
59 |
|
60 |
###### Anthropic
|
61 |
```python
|
62 |
-
|
63 |
-
|
|
|
64 |
|
65 |
'''
|
66 |
'correct'
|
67 |
'''
|
68 |
```
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
#### Transformer Match
|
71 |
Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! π₯π₯π₯
|
72 |
|
|
|
11 |
---
|
12 |
# QA-Evaluation-Metrics
|
13 |
|
14 |
+
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
|
15 |
+
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
|
16 |
|
17 |
+
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), an efficient QA evaluation that retains competitive evaluation performance of transformer LLM models.
|
18 |
+
|
19 |
+
### Updates
|
20 |
+
- Uopdated to version 0.2.8
|
21 |
+
- Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
|
22 |
+
- Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
|
23 |
|
24 |
|
25 |
## Installation
|
26 |
+
* Python version >= 3.6
|
27 |
+
* openai version >= 1.0
|
28 |
+
|
29 |
|
30 |
To install the package, run the following command:
|
31 |
|
|
|
35 |
|
36 |
## Usage
|
37 |
|
38 |
+
The python package currently provides six QA evaluation methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
#### Prompting LLM For Evaluation
|
41 |
|
|
|
43 |
|
44 |
###### OpenAI
|
45 |
```python
|
46 |
+
from qa_metrics.prompt_llm import CloseLLM
|
47 |
+
model = CloseLLM()
|
48 |
+
model.set_openai_api_key(YOUR_OPENAI_KEY)
|
49 |
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
50 |
+
model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
|
51 |
|
52 |
'''
|
53 |
'correct'
|
|
|
56 |
|
57 |
###### Anthropic
|
58 |
```python
|
59 |
+
model = CloseLLM()
|
60 |
+
model.set_anthropic_api_key(YOUR_Anthropic_KEY)
|
61 |
+
model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
|
62 |
|
63 |
'''
|
64 |
'correct'
|
65 |
'''
|
66 |
```
|
67 |
|
68 |
+
###### deepinfra (See below for descriptions of more models)
|
69 |
+
```python
|
70 |
+
from qa_metrics.prompt_open_llm import OpenLLM
|
71 |
+
model = OpenLLM()
|
72 |
+
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
|
73 |
+
model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
|
74 |
+
|
75 |
+
'''
|
76 |
+
'correct'
|
77 |
+
'''
|
78 |
+
```
|
79 |
+
|
80 |
+
#### Exact Match
|
81 |
+
```python
|
82 |
+
from qa_metrics.em import em_match
|
83 |
+
|
84 |
+
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
|
85 |
+
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
|
86 |
+
match_result = em_match(reference_answer, candidate_answer)
|
87 |
+
print("Exact Match: ", match_result)
|
88 |
+
'''
|
89 |
+
Exact Match: False
|
90 |
+
'''
|
91 |
+
```
|
92 |
+
|
93 |
#### Transformer Match
|
94 |
Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! π₯π₯π₯
|
95 |
|