Update README.md
Browse files
README.md
CHANGED
@@ -11,12 +11,21 @@ pipeline_tag: text-classification
|
|
11 |
---
|
12 |
# QA-Evaluation-Metrics
|
13 |
|
14 |
-
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
|
|
|
15 |
|
16 |
-
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161),
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
|
19 |
## Installation
|
|
|
|
|
|
|
20 |
|
21 |
To install the package, run the following command:
|
22 |
|
@@ -26,7 +35,47 @@ pip install qa-metrics
|
|
26 |
|
27 |
## Usage
|
28 |
|
29 |
-
The python package currently provides
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
#### Exact Match
|
32 |
```python
|
@@ -57,31 +106,6 @@ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely ba
|
|
57 |
'''
|
58 |
```
|
59 |
|
60 |
-
#### Prompting LLM For Evaluation
|
61 |
-
|
62 |
-
Note: The prompting function can be used for any prompting purposes.
|
63 |
-
|
64 |
-
###### OpenAI
|
65 |
-
```python
|
66 |
-
from qa_metrics.prompt_llm import *
|
67 |
-
set_openai_api_key(YOUR_OPENAI_KEY)
|
68 |
-
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
69 |
-
prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_token=10)
|
70 |
-
|
71 |
-
'''
|
72 |
-
'correct'
|
73 |
-
'''
|
74 |
-
```
|
75 |
-
|
76 |
-
###### Anthropic
|
77 |
-
```python
|
78 |
-
set_anthropic_api_key(YOUR_OPENAI_KEY)
|
79 |
-
prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
|
80 |
-
|
81 |
-
'''
|
82 |
-
'correct'
|
83 |
-
'''
|
84 |
-
```
|
85 |
|
86 |
#### F1 Score
|
87 |
```python
|
|
|
11 |
---
|
12 |
# QA-Evaluation-Metrics
|
13 |
|
14 |
+
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
|
15 |
+
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
|
16 |
|
17 |
+
QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), an efficient QA evaluation that retains competitive evaluation performance of transformer LLM models.
|
18 |
+
|
19 |
+
### Updates
|
20 |
+
- Uopdated to version 0.2.8
|
21 |
+
- Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
|
22 |
+
- Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
|
23 |
|
24 |
|
25 |
## Installation
|
26 |
+
* Python version >= 3.6
|
27 |
+
* openai version >= 1.0
|
28 |
+
|
29 |
|
30 |
To install the package, run the following command:
|
31 |
|
|
|
35 |
|
36 |
## Usage
|
37 |
|
38 |
+
The python package currently provides six QA evaluation methods.
|
39 |
+
|
40 |
+
#### Prompting LLM For Evaluation
|
41 |
+
|
42 |
+
Note: The prompting function can be used for any prompting purposes.
|
43 |
+
|
44 |
+
###### OpenAI
|
45 |
+
```python
|
46 |
+
from qa_metrics.prompt_llm import CloseLLM
|
47 |
+
model = CloseLLM()
|
48 |
+
model.set_openai_api_key(YOUR_OPENAI_KEY)
|
49 |
+
prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
|
50 |
+
model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
|
51 |
+
|
52 |
+
'''
|
53 |
+
'correct'
|
54 |
+
'''
|
55 |
+
```
|
56 |
+
|
57 |
+
###### Anthropic
|
58 |
+
```python
|
59 |
+
model = CloseLLM()
|
60 |
+
model.set_anthropic_api_key(YOUR_Anthropic_KEY)
|
61 |
+
model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
|
62 |
+
|
63 |
+
'''
|
64 |
+
'correct'
|
65 |
+
'''
|
66 |
+
```
|
67 |
+
|
68 |
+
###### deepinfra (See below for descriptions of more models)
|
69 |
+
```python
|
70 |
+
from qa_metrics.prompt_open_llm import OpenLLM
|
71 |
+
model = OpenLLM()
|
72 |
+
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
|
73 |
+
model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
|
74 |
+
|
75 |
+
'''
|
76 |
+
'correct'
|
77 |
+
'''
|
78 |
+
```
|
79 |
|
80 |
#### Exact Match
|
81 |
```python
|
|
|
106 |
'''
|
107 |
```
|
108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
#### F1 Score
|
111 |
```python
|