# Subjective Evaluation Guidance ## Introduction Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation. To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)). A popular evaluation method involves - Compare Mode: comparing model responses pairwise to calculate their win rate - Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)). We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods. ## Subjective Evaluation with Custom Dataset The specific process includes: 1. Data preparation 2. Model response generation 3. Evaluate the response with a JudgeLLM 4. Generate JudgeLLM's response and calculate the metric ### Step-1: Data Preparation We provide mini test-set for **Compare Mode** and **Score Mode** as below: ```python ###COREV2 [ { "question": "如果我在空中垂直抛球,球最初向哪个方向行进?", "capability": "知识-社会常识", "others": { "question": "如果我在空中垂直抛球,球最初向哪个方向行进?", "evaluating_guidance": "", "reference_answer": "上" } },...] ###CreationV0.1 [ { "question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。", "capability": "邮件通知", "others": "" }, ``` The json must includes the following fields: - 'question': Question description - 'capability': The capability dimension of the question. - 'others': Other needed information. If you want to modify prompt on each single question, you can full some other information into 'others' and construct it. ### Step-2: Evaluation Configuration(Compare Mode) For `config/eval_subjective_compare.py`, we provide some annotations to help users understand the configuration file. ```python from mmengine.config import read_base from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI from opencompass.partitioners import NaivePartitioner from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner from opencompass.runners import LocalRunner from opencompass.runners import SlurmSequentialRunner from opencompass.tasks import OpenICLInferTask from opencompass.tasks.subjective_eval import SubjectiveEvalTask from opencompass.summarizers import Corev2Summarizer with read_base(): # Pre-defined models from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat from .models.chatglm.hf_chatglm3_6b import models as hf_chatglm3_6b from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat from .models.openai.gpt_4 import models as gpt4_model from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets # Evaluation datasets datasets = [*subjective_datasets] # Model to be evaluated models = [*hf_qwen_7b_chat, *hf_chatglm3_6b] # Inference configuration infer = dict( partitioner=dict(type=NaivePartitioner), runner=dict( type=SlurmSequentialRunner, partition='llmeval', quotatype='auto', max_num_workers=256, task=dict(type=OpenICLInferTask)), ) # Evaluation configuration eval = dict( partitioner=dict( type=SubjectiveNaivePartitioner, mode='m2n', # m-model v.s n-model # Under m2n setting # must specify base_models and compare_models, program will generate pairs between base_models compare_models. base_models = [*hf_qwen_14b_chat], # Baseline model compare_models = [*hf_baichuan2_7b, *hf_chatglm3_6b] # model to be evaluated ), runner=dict( type=SlurmSequentialRunner, partition='llmeval', quotatype='auto', max_num_workers=256, task=dict( type=SubjectiveEvalTask, judge_cfg=gpt4_model # Judge model )), ) work_dir = './outputs/subjective/' summarizer = dict( type=Corev2Summarizer, # Custom summarizer match_method='smart', # Answer extraction ) ``` In addition, you can also change the response order of the two models, please refer to `config/eval_subjective_compare.py`, when `infer_order` is setting to `random`, the response will be random ordered, when `infer_order` is setting to `double`, the response of two models will be doubled in two ways. ### Step-2: Evaluation Configuration(Score Mode) For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`. ### Step-3: Launch the Evaluation ```shell python run.py config/eval_subjective_score.py -r ``` The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results. The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`. The evaluation report will be output to `output/.../summary/timestamp/report.csv`. Opencompass has supported lots of JudgeLLM, actually, you can take any model as JudgeLLM in opencompass configs. And we list the popular open-source JudgeLLM here: 1. Auto-J, refer to `configs/models/judge_llm/auto_j` Consider cite the following paper if you find it helpful: ```bibtex @article{li2023generative, title={Generative judge for evaluating alignment}, author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei}, journal={arXiv preprint arXiv:2310.05470}, year={2023} } @misc{2023opencompass, title={OpenCompass: A Universal Evaluation Platform for Foundation Models}, author={OpenCompass Contributors}, howpublished = {\url{https://github.com/open-compass/opencompass}}, year={2023} } ``` 2. JudgeLM, refer to `configs/models/judge_llm/judgelm` ```bibtex @article{zhu2023judgelm, title={JudgeLM: Fine-tuned Large Language Models are Scalable Judges}, author={Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong}, journal={arXiv preprint arXiv:2310.17631}, year={2023} } @misc{2023opencompass, title={OpenCompass: A Universal Evaluation Platform for Foundation Models}, author={OpenCompass Contributors}, howpublished = {\url{https://github.com/open-compass/opencompass}}, year={2023} } ``` 3. PandaLM, refer to `configs/models/judge_llm/pandalm` Consider cite the following paper if you find it helpful: ```bibtex @article{wang2023pandalm, title={PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization}, author={Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and others}, journal={arXiv preprint arXiv:2306.05087}, year={2023} } @misc{2023opencompass, title={OpenCompass: A Universal Evaluation Platform for Foundation Models}, author={OpenCompass Contributors}, howpublished = {\url{https://github.com/open-compass/opencompass}}, year={2023} } ``` ## Practice: AlignBench Evaluation ### Dataset ```bash mkdir -p ./data/subjective/ cd ./data/subjective git clone https://github.com/THUDM/AlignBench.git # data format conversion python ../../../tools/convert_alignmentbench.py --mode json --jsonl data/data_release.jsonl ``` ### Configuration Please edit the config `configs/eval_subjective_alignbench.py` according to your demand. ### Evaluation ```bash HF_EVALUATE_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python run.py workspace/eval_subjective_alignbench.py ``` ### Submit to Official Leaderboard(Optional) If you need to submit your prediction into official leaderboard, you can use `tools/convert_alignmentbench.py` for format conversion. - Make sure you have the following results ```bash outputs/ └── 20231214_173632 ├── configs ├── logs ├── predictions # model's response ├── results └── summary ``` - Convert the data ```bash python tools/convert_alignmentbench.py --mode csv --exp-folder outputs/20231214_173632 ``` - Get `.csv` in `submission/` for submission ```bash outputs/ └── 20231214_173632 ├── configs ├── logs ├── predictions ├── results ├── submission # 可提交文件 └── summary ```