TwT-6's picture
Upload 2667 files
256a159 verified

Metric Calculation

In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the type of standard answer, generally including the following types:

  • Choice: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- ACCEvaluator.
  • Phrase: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--EMEvaluator.
  • Sentence: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--BleuEvaluator.
  • Paragraph: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--RougeEvaluator.
  • Code: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and pass@k are usually used as the evaluation standard. At present, Opencompass supports MBPPEvaluator and HumanEvaluator.

There is also a type of scoring-type evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports ToxicEvaluator, and currently, the realtoxicityprompts dataset uses this evaluation method.

Supported Evaluation Metrics

Currently, in OpenCompass, commonly used Evaluators are mainly located in the opencompass/openicl/icl_evaluator folder. There are also some dataset-specific indicators that are placed in parts of opencompass/datasets. Below is a summary:

Evaluation Strategy Evaluation Metrics Common Postprocessing Method Datasets
ACCEvaluator Accuracy first_capital_postprocess agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag
EMEvaluator Match Rate None, dataset-specific drop, CLUE_CMRC, CLUE_DRCD
BleuEvaluator BLEU None, flores flores, iwslt2017, summscreen, govrepcrs
RougeEvaluator ROUGE None, dataset-specific truthfulqa, Xsum, XLSum
JiebaRougeEvaluator ROUGE None, dataset-specific lcsts
HumanEvaluator pass@k humaneval_postprocess humaneval_postprocess
MBPPEvaluator Execution Pass Rate None mbpp
ToxicEvaluator PerspectiveAPI None realtoxicityprompts
AGIEvalEvaluator Accuracy None agieval
AUCROCEvaluator AUC-ROC None jigsawmultilingual, civilcomments
MATHEvaluator Accuracy math_postprocess math
MccEvaluator Matthews Correlation None --
SquadEvaluator F1-scores None --

How to Configure

The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to dataset.infer_cfg as an instantiation parameter.

Below is the definition of govrepcrs_eval_cfg, and you can refer to configs/datasets/govrepcrs.

from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import GovRepcrsDataset
from opencompass.utils.text_postprocessors import general_cn_postprocess

govrepcrs_reader_cfg = dict(.......)
govrepcrs_infer_cfg = dict(.......)

# Configuration of evaluation metrics
govrepcrs_eval_cfg = dict(
    evaluator=dict(type=BleuEvaluator),            # Use the common translator evaluator BleuEvaluator
    pred_role='BOT',                               # Accept 'BOT' role output
    pred_postprocessor=dict(type=general_cn_postprocess),      # Postprocessing of prediction results
    dataset_postprocessor=dict(type=general_cn_postprocess))   # Postprocessing of dataset standard answers

govrepcrs_datasets = [
        type=GovRepcrsDataset,                 # Dataset class name
        path='./data/govrep/',                 # Dataset path
        abbr='GovRepcrs',                      # Dataset alias
        reader_cfg=govrepcrs_reader_cfg,       # Dataset reading configuration file, configure its reading split, column, etc.
        infer_cfg=govrepcrs_infer_cfg,         # Dataset inference configuration file, mainly related to prompt
        eval_cfg=govrepcrs_eval_cfg)           # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing.