Zongxia commited on
Commit
95515ee
1 Parent(s): bdd2295

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -28
README.md CHANGED
@@ -11,12 +11,21 @@ pipeline_tag: text-classification
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
- [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
 
15
 
16
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), a matching method going beyond token-level matching and is more efficient than LLM matchings but still retains competitive evaluation performance of transformer LLM models.
 
 
 
 
 
17
 
18
 
19
  ## Installation
 
 
 
20
 
21
  To install the package, run the following command:
22
 
@@ -26,7 +35,47 @@ pip install qa-metrics
26
 
27
  ## Usage
28
 
29
- The python package currently provides four QA evaluation metrics.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  #### Exact Match
32
  ```python
@@ -57,31 +106,6 @@ Score: {'The Frog Prince': {'The movie "The Princess and the Frog" is loosely ba
57
  '''
58
  ```
59
 
60
- #### Prompting LLM For Evaluation
61
-
62
- Note: The prompting function can be used for any prompting purposes.
63
-
64
- ###### OpenAI
65
- ```python
66
- from qa_metrics.prompt_llm import *
67
- set_openai_api_key(YOUR_OPENAI_KEY)
68
- prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
69
- prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_token=10)
70
-
71
- '''
72
- 'correct'
73
- '''
74
- ```
75
-
76
- ###### Anthropic
77
- ```python
78
- set_anthropic_api_key(YOUR_OPENAI_KEY)
79
- prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
80
-
81
- '''
82
- 'correct'
83
- '''
84
- ```
85
 
86
  #### F1 Score
87
  ```python
 
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), an efficient QA evaluation that retains competitive evaluation performance of transformer LLM models.
18
+
19
+ ### Updates
20
+ - Uopdated to version 0.2.8
21
+ - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
+ - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
 
24
 
25
  ## Installation
26
+ * Python version >= 3.6
27
+ * openai version >= 1.0
28
+
29
 
30
  To install the package, run the following command:
31
 
 
35
 
36
  ## Usage
37
 
38
+ The python package currently provides six QA evaluation methods.
39
+
40
+ #### Prompting LLM For Evaluation
41
+
42
+ Note: The prompting function can be used for any prompting purposes.
43
+
44
+ ###### OpenAI
45
+ ```python
46
+ from qa_metrics.prompt_llm import CloseLLM
47
+ model = CloseLLM()
48
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
49
+ prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
50
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
51
+
52
+ '''
53
+ 'correct'
54
+ '''
55
+ ```
56
+
57
+ ###### Anthropic
58
+ ```python
59
+ model = CloseLLM()
60
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
61
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
62
+
63
+ '''
64
+ 'correct'
65
+ '''
66
+ ```
67
+
68
+ ###### deepinfra (See below for descriptions of more models)
69
+ ```python
70
+ from qa_metrics.prompt_open_llm import OpenLLM
71
+ model = OpenLLM()
72
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
73
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
74
+
75
+ '''
76
+ 'correct'
77
+ '''
78
+ ```
79
 
80
  #### Exact Match
81
  ```python
 
106
  '''
107
  ```
108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  #### F1 Score
111
  ```python