Zongxia commited on
Commit
1ccb401
β€’
1 Parent(s): 30cc38e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -21
README.md CHANGED
@@ -11,12 +11,21 @@ pipeline_tag: text-classification
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
- [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
 
15
 
16
- QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), a matching method going beyond token-level matching and is more efficient than LLM matchings but still retains competitive evaluation performance of transformer LLM models.
 
 
 
 
 
17
 
18
 
19
  ## Installation
 
 
 
20
 
21
  To install the package, run the following command:
22
 
@@ -26,20 +35,7 @@ pip install qa-metrics
26
 
27
  ## Usage
28
 
29
- The python package currently provides four QA evaluation metrics.
30
-
31
- #### Exact Match
32
- ```python
33
- from qa_metrics.em import em_match
34
-
35
- reference_answer = ["The Frog Prince", "The Princess and the Frog"]
36
- candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
37
- match_result = em_match(reference_answer, candidate_answer)
38
- print("Exact Match: ", match_result)
39
- '''
40
- Exact Match: False
41
- '''
42
- ```
43
 
44
  #### Prompting LLM For Evaluation
45
 
@@ -47,10 +43,11 @@ Note: The prompting function can be used for any prompting purposes.
47
 
48
  ###### OpenAI
49
  ```python
50
- from qa_metrics.prompt_llm import *
51
- set_openai_api_key(YOUR_OPENAI_KEY)
 
52
  prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
53
- prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_token=10)
54
 
55
  '''
56
  'correct'
@@ -59,14 +56,40 @@ prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tok
59
 
60
  ###### Anthropic
61
  ```python
62
- set_anthropic_api_key(YOUR_OPENAI_KEY)
63
- prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
 
64
 
65
  '''
66
  'correct'
67
  '''
68
  ```
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  #### Transformer Match
71
  Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! πŸ”₯πŸ”₯πŸ”₯
72
 
 
11
  ---
12
  # QA-Evaluation-Metrics
13
 
14
+ [![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/)
15
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17b7vrZqH0Yun2AJaOXydYZxr3cw20Ga6?usp=sharing)
16
 
17
+ QA-Evaluation-Metrics is a fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. It provides various basic metrics to assess the performance of QA models. Check out our paper [**PANDA**](https://arxiv.org/abs/2402.11161), an efficient QA evaluation that retains competitive evaluation performance of transformer LLM models.
18
+
19
+ ### Updates
20
+ - Uopdated to version 0.2.8
21
+ - Supports prompting OPENAI GPT-series models and Claude Series models now. (Assuimg OPENAI version > 1.0)
22
+ - Supports prompting various open source models such as LLaMA-2-70B-chat, LLaVA-1.5 etc by calling API from [deepinfra](https://deepinfra.com/models).
23
 
24
 
25
  ## Installation
26
+ * Python version >= 3.6
27
+ * openai version >= 1.0
28
+
29
 
30
  To install the package, run the following command:
31
 
 
35
 
36
  ## Usage
37
 
38
+ The python package currently provides six QA evaluation methods.
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  #### Prompting LLM For Evaluation
41
 
 
43
 
44
  ###### OpenAI
45
  ```python
46
+ from qa_metrics.prompt_llm import CloseLLM
47
+ model = CloseLLM()
48
+ model.set_openai_api_key(YOUR_OPENAI_KEY)
49
  prompt = 'question: What is the Capital of France?\nreference: Paris\ncandidate: The capital is Paris\nIs the candidate answer correct based on the question and reference answer? Please only output correct or incorrect.'
50
+ model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo', temperature=0.1, max_tokens=10)
51
 
52
  '''
53
  'correct'
 
56
 
57
  ###### Anthropic
58
  ```python
59
+ model = CloseLLM()
60
+ model.set_anthropic_api_key(YOUR_Anthropic_KEY)
61
+ model.prompt_claude(prompt=prompt, model_engine='claude-v1', anthropic_version="2023-06-01", max_tokens_to_sample=100, temperature=0.7)
62
 
63
  '''
64
  'correct'
65
  '''
66
  ```
67
 
68
+ ###### deepinfra (See below for descriptions of more models)
69
+ ```python
70
+ from qa_metrics.prompt_open_llm import OpenLLM
71
+ model = OpenLLM()
72
+ model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
73
+ model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1', temperature=0.1, max_tokens=10)
74
+
75
+ '''
76
+ 'correct'
77
+ '''
78
+ ```
79
+
80
+ #### Exact Match
81
+ ```python
82
+ from qa_metrics.em import em_match
83
+
84
+ reference_answer = ["The Frog Prince", "The Princess and the Frog"]
85
+ candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
86
+ match_result = em_match(reference_answer, candidate_answer)
87
+ print("Exact Match: ", match_result)
88
+ '''
89
+ Exact Match: False
90
+ '''
91
+ ```
92
+
93
  #### Transformer Match
94
  Our fine-tuned BERT model is this repository. Our Package also supports downloading and matching directly. distilroberta, distilbert, and roberta are also supported now! πŸ”₯πŸ”₯πŸ”₯
95