|
# Metric Calculation |
|
|
|
In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types: |
|
|
|
- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`. |
|
- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`. |
|
- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`. |
|
- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`. |
|
- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvaluator`. |
|
|
|
There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method. |
|
|
|
## Supported Evaluation Metrics |
|
|
|
Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/open-compass/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/open-compass/opencompass/tree/main/opencompass/datasets). Below is a summary: |
|
|
|
| Evaluation Strategy | Evaluation Metrics | Common Postprocessing Method | Datasets | |
|
| --------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- | |
|
| `ACCEvaluator` | Accuracy | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag | |
|
| `EMEvaluator` | Match Rate | None, dataset-specific | drop, CLUE_CMRC, CLUE_DRCD | |
|
| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs | |
|
| `RougeEvaluator` | ROUGE | None, dataset-specific | truthfulqa, Xsum, XLSum | |
|
| `JiebaRougeEvaluator` | ROUGE | None, dataset-specific | lcsts | |
|
| `HumanEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess | |
|
| `MBPPEvaluator` | Execution Pass Rate | None | mbpp | |
|
| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts | |
|
| `AGIEvalEvaluator` | Accuracy | None | agieval | |
|
| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments | |
|
| `MATHEvaluator` | Accuracy | `math_postprocess` | math | |
|
| `MccEvaluator` | Matthews Correlation | None | -- | |
|
| `SquadEvaluator` | F1-scores | None | -- | |
|
|
|
## How to Configure |
|
|
|
The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter. |
|
|
|
Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/open-compass/opencompass/tree/main/configs/datasets/govrepcrs). |
|
|
|
```python |
|
from opencompass.openicl.icl_evaluator import BleuEvaluator |
|
from opencompass.datasets import GovRepcrsDataset |
|
from opencompass.utils.text_postprocessors import general_cn_postprocess |
|
|
|
govrepcrs_reader_cfg = dict(.......) |
|
govrepcrs_infer_cfg = dict(.......) |
|
|
|
# Configuration of evaluation metrics |
|
govrepcrs_eval_cfg = dict( |
|
evaluator=dict(type=BleuEvaluator), # Use the common translator evaluator BleuEvaluator |
|
pred_role='BOT', # Accept 'BOT' role output |
|
pred_postprocessor=dict(type=general_cn_postprocess), # Postprocessing of prediction results |
|
dataset_postprocessor=dict(type=general_cn_postprocess)) # Postprocessing of dataset standard answers |
|
|
|
govrepcrs_datasets = [ |
|
dict( |
|
type=GovRepcrsDataset, # Dataset class name |
|
path='./data/govrep/', # Dataset path |
|
abbr='GovRepcrs', # Dataset alias |
|
reader_cfg=govrepcrs_reader_cfg, # Dataset reading configuration file, configure its reading split, column, etc. |
|
infer_cfg=govrepcrs_infer_cfg, # Dataset inference configuration file, mainly related to prompt |
|
eval_cfg=govrepcrs_eval_cfg) # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing. |
|
] |
|
``` |
|
|