|
--- |
|
inference: false |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- exact_match |
|
- f1 |
|
- bertscore |
|
pipeline_tag: text-classification |
|
--- |
|
# QA-Evaluation-Metrics π |
|
|
|
[![PyPI version qa-metrics](https://img.shields.io/pypi/v/qa-metrics.svg)](https://pypi.org/project/qa-metrics/) |
|
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ke23KIeHFdPWad0BModmcWKZ6jSbF5nI?usp=sharing) |
|
|
|
> Check out the main [Repo](https://github.com/zli12321/qa_metrics) |
|
|
|
> A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models. |
|
|
|
## π Latest Updates |
|
|
|
- **Version 0.2.19 Released!** |
|
- Paper accepted to EMNLP 2024 Findings! π |
|
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling |
|
- Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0) |
|
- Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via [deepinfra](https://deepinfra.com/models) |
|
- Introduced trained tiny-bert for QA evaluation (18MB model size) |
|
- Added direct Huggingface model download support for TransformerMatcher |
|
|
|
## π Quick Start |
|
|
|
### Prerequisites |
|
- Python >= 3.6 |
|
- openai >= 1.0 |
|
|
|
### Installation |
|
```bash |
|
pip install qa-metrics |
|
``` |
|
|
|
## π‘ Features |
|
|
|
Our package offers six QA evaluation methods with varying strengths: |
|
|
|
| Method | Best For | Cost | Correlation with Human Judgment | |
|
|--------|----------|------|--------------------------------| |
|
| Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good | |
|
| PEDANTS | Both short & medium-form QA | Free | Very High | |
|
| [Neural Evaluation](https://huggingface.co/zli12321/answer_equivalence_tiny_bert) | Both short & long-form QA | Free | High | |
|
| [Open Source LLM Evaluation](https://huggingface.co/zli12321/prometheus2-2B) | All QA types | Free | High | |
|
| Black-box LLM Evaluation | All QA types | Paid | Highest | |
|
|
|
## π Documentation |
|
|
|
### 1. Normalized Exact Match |
|
|
|
#### Method: `em_match` |
|
**Parameters** |
|
- `reference_answer` (list of str): A list of gold (correct) answers to the question |
|
- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated |
|
|
|
**Returns** |
|
- `boolean`: True if there are any exact normalized matches between gold and candidate answers |
|
|
|
```python |
|
from qa_metrics.em import em_match |
|
|
|
reference_answer = ["The Frog Prince", "The Princess and the Frog"] |
|
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\"" |
|
match_result = em_match(reference_answer, candidate_answer) |
|
``` |
|
|
|
### 2. F1 Score |
|
|
|
#### Method: `f1_score_with_precision_recall` |
|
**Parameters** |
|
- `reference_answer` (str): A gold (correct) answer to the question |
|
- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated |
|
|
|
**Returns** |
|
- `dictionary`: Contains the F1 score, precision, and recall between a gold and candidate answer |
|
|
|
#### Method: `f1_match` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `threshold` (float): F1 score threshold for considering a match (default: 0.5) |
|
|
|
**Returns** |
|
- `boolean`: True if F1 score exceeds threshold for any gold answer |
|
|
|
```python |
|
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall |
|
|
|
f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer) |
|
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5) |
|
``` |
|
|
|
### 3. PEDANTS |
|
|
|
#### Method: `get_score` |
|
**Parameters** |
|
- `reference_answer` (str): A Gold answer |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `float`: The similarity score between two strings (0 to 1) |
|
|
|
#### Method: `get_highest_score` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score |
|
|
|
#### Method: `get_scores` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs |
|
|
|
#### Method: `evaluate` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `boolean`: True if candidate answer matches any gold answer |
|
|
|
#### Method: `get_question_type` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `list`: The type of the question (what, who, when, how, why, which, where) |
|
|
|
#### Method: `get_judgement_type` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `list`: A list revised rules applicable to judge answer correctness |
|
|
|
```python |
|
from qa_metrics.pedant import PEDANT |
|
|
|
pedant = PEDANT() |
|
scores = pedant.get_scores(reference_answer, candidate_answer, question) |
|
match_result = pedant.evaluate(reference_answer, candidate_answer, question) |
|
``` |
|
|
|
### 4. Transformer Neural Evaluation |
|
|
|
#### Method: `get_score` |
|
**Parameters** |
|
- `reference_answer` (str): A Gold answer |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `float`: The similarity score between two strings (0 to 1) |
|
|
|
#### Method: `get_highest_score` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `dictionary`: Contains the gold answer and candidate answer pair with highest matching score |
|
|
|
#### Method: `get_scores` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `dictionary`: Contains matching scores for all gold answer and candidate answer pairs |
|
|
|
#### Method: `transformer_match` |
|
**Parameters** |
|
- `reference_answer` (list of str): List of gold answers |
|
- `candidate_answer` (str): Candidate answer to evaluate |
|
- `question` (str): The question being evaluated |
|
|
|
**Returns** |
|
- `boolean`: True if transformer model considers candidate answer equivalent to any gold answer |
|
|
|
```python |
|
from qa_metrics.transformerMatcher import TransformerMatcher |
|
|
|
### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta` |
|
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert") |
|
match_result = tm.transformer_match(reference_answer, candidate_answer, question) |
|
``` |
|
|
|
### 5. LLM Integration |
|
|
|
#### Method: `prompt_gpt` |
|
**Parameters** |
|
- `prompt` (str): The input prompt text |
|
- `model_engine` (str): OpenAI model to use (e.g., 'gpt-3.5-turbo') |
|
- `temperature` (float): Controls randomness (0-1) |
|
- `max_tokens` (int): Maximum tokens in response |
|
|
|
```python |
|
from qa_metrics.prompt_llm import CloseLLM |
|
|
|
model = CloseLLM() |
|
model.set_openai_api_key(YOUR_OPENAI_KEY) |
|
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo') |
|
``` |
|
|
|
#### Method: `prompt_claude` |
|
**Parameters** |
|
- `prompt` (str): The input prompt text |
|
- `model_engine` (str): Claude model to use |
|
- `anthropic_version` (str): API version |
|
- `max_tokens_to_sample` (int): Maximum tokens in response |
|
- `temperature` (float): Controls randomness (0-1) |
|
|
|
```python |
|
model = CloseLLM() |
|
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY) |
|
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1') |
|
``` |
|
|
|
#### Method: `prompt` |
|
**Parameters** |
|
- `message` (str): The input message text |
|
- `model_engine` (str): Model to use |
|
- `temperature` (float): Controls randomness (0-1) |
|
- `max_tokens` (int): Maximum tokens in response |
|
|
|
```python |
|
from qa_metrics.prompt_open_llm import OpenLLM |
|
|
|
model = OpenLLM() |
|
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY) |
|
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1') |
|
``` |
|
|
|
## π€ Model Hub |
|
|
|
Our fine-tuned models are available on Huggingface: |
|
- [BERT](https://huggingface.co/Zongxia/answer_equivalence_bert) |
|
- [DistilRoBERTa](https://huggingface.co/Zongxia/answer_equivalence_distilroberta) |
|
- [DistilBERT](https://huggingface.co/Zongxia/answer_equivalence_distilbert) |
|
- [RoBERTa](https://huggingface.co/Zongxia/answer_equivalence_roberta) |
|
- [Tiny-BERT](https://huggingface.co/Zongxia/answer_equivalence_tiny_bert) |
|
- [RoBERTa-Large](https://huggingface.co/Zongxia/answer_equivalence_roberta-large) |
|
|
|
## π Resources |
|
|
|
- [Full Paper](https://arxiv.org/abs/2402.11161) |
|
- [Dataset Repository](https://github.com/zli12321/Answer_Equivalence_Dataset.git) |
|
- [Supported Models on Deepinfra](https://deepinfra.com/models) |
|
|
|
## π Citation |
|
|
|
```bibtex |
|
@misc{li2024pedantspreciseevaluationsdiverse, |
|
title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, |
|
author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber}, |
|
year={2024}, |
|
eprint={2402.11161}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2402.11161}, |
|
} |
|
``` |
|
|
|
## π License |
|
|
|
This project is licensed under the [MIT License](LICENSE.md). |
|
|
|
## π¬ Contact |
|
|
|
For questions or comments, please contact: [email protected] |