File size: 16,912 Bytes
256a159 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 |
# Quick Start
![image](https://github.com/open-compass/opencompass/assets/22607038/d063cae0-3297-4fd2-921a-366e0a24890b)
## Overview
OpenCompass provides a streamlined workflow for evaluating a model, which consists of the following stages: **Configure** -> **Inference** -> **Evaluation** -> **Visualization**.
**Configure**: This is your starting point. Here, you'll set up the entire evaluation process, choosing the model(s) and dataset(s) to assess. You also have the option to select an evaluation strategy, the computation backend, and define how you'd like the results displayed.
**Inference & Evaluation**: OpenCompass efficiently manages the heavy lifting, conducting parallel inference and evaluation on your chosen model(s) and dataset(s). The **Inference** phase is all about producing outputs from your datasets, whereas the **Evaluation** phase measures how well these outputs align with the gold standard answers. While this procedure is broken down into multiple "tasks" that run concurrently for greater efficiency, be aware that working with limited computational resources might introduce some unexpected overheads, and resulting in generally slower evaluation. To understand this issue and know how to solve it, check out [FAQ: Efficiency](faq.md#efficiency).
**Visualization**: Once the evaluation is done, OpenCompass collates the results into an easy-to-read table and saves them as both CSV and TXT files. If you need real-time updates, you can activate lark reporting and get immediate status reports in your Lark clients.
Coming up, we'll walk you through the basics of OpenCompass, showcasing evaluations of pretrained models [OPT-125M](https://huggingface.co/facebook/opt-125m) and [OPT-350M](https://huggingface.co/facebook/opt-350m) on the [SIQA](https://huggingface.co/datasets/social_i_qa) and [Winograd](https://huggingface.co/datasets/winograd_wsc) benchmark tasks. Their configuration files can be found at [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py).
Before running this experiment, please make sure you have installed OpenCompass locally and it should run successfully under one _GTX-1660-6G_ GPU.
For larger parameterized models like Llama-7B, refer to other examples provided in the [configs directory](https://github.com/open-compass/opencompass/tree/main/configs).
## Configuring an Evaluation Task
In OpenCompass, each evaluation task consists of the model to be evaluated and the dataset. The entry point for evaluation is `run.py`. Users can select the model and dataset to be tested either via command line or configuration files.
`````{tabs}
````{tab} Command Line
Users can combine the models and datasets they want to test using `--models` and `--datasets`.
```bash
python run.py --models hf_opt_125m hf_opt_350m --datasets siqa_gen winograd_ppl
```
The models and datasets are pre-stored in the form of configuration files in `configs/models` and `configs/datasets`. Users can view or filter the currently available model and dataset configurations using `tools/list_configs.py`.
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
:::{dropdown} More about `list_configs`
:animate: fade-in-slide-down
Running `python tools/list_configs.py llama mmlu` gives the output like:
```text
+-----------------+-----------------------------------+
| Model | Config Path |
|-----------------+-----------------------------------|
| hf_llama2_13b | configs/models/hf_llama2_13b.py |
| hf_llama2_70b | configs/models/hf_llama2_70b.py |
| ... | ... |
+-----------------+-----------------------------------+
+-------------------+---------------------------------------------------+
| Dataset | Config Path |
|-------------------+---------------------------------------------------|
| cmmlu_gen | configs/datasets/cmmlu/cmmlu_gen.py |
| cmmlu_gen_ffe7c0 | configs/datasets/cmmlu/cmmlu_gen_ffe7c0.py |
| ... | ... |
+-------------------+---------------------------------------------------+
```
Users can use the names in the first column as input parameters for `--models` and `--datasets` in `python run.py`. For datasets, the same name with different suffixes generally indicates that its prompts or evaluation methods are different.
:::
:::{dropdown} Model not on the list?
:animate: fade-in-slide-down
If you want to evaluate other models, please check out the "Command Line (Custom HF Model)" tab for the way to construct a custom HF model without a configuration file, or "Configuration File" tab to learn the general way to prepare your model configurations.
:::
````
````{tab} Command Line (Custom HF Model)
For HuggingFace models, users can set model parameters directly through the command line without additional configuration files. For instance, for the `facebook/opt-125m` model, you can evaluate it with the following command:
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \
--model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \
--max-seq-len 2048 \
--max-out-len 100 \
--batch-size 128 \
--num-gpus 1 # Number of minimum required GPUs
```
Note that in this way, OpenCompass only evaluates one model at a time, while other ways can evaluate multiple models at once.
```{caution}
`--num-gpus` does not stand for the actual number of GPUs to use in evaluation, but the minimum required number of GPUs for this model. [More](faq.md#how-does-opencompass-allocate-gpus)
```
:::{dropdown} More detailed example
:animate: fade-in-slide-down
```bash
python run.py --datasets siqa_gen winograd_ppl \
--hf-path facebook/opt-125m \ # HuggingFace model path
--tokenizer-path facebook/opt-125m \ # HuggingFace tokenizer path (if the same as the model path, can be omitted)
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True \ # Arguments to construct the tokenizer
--model-kwargs device_map='auto' \ # Arguments to construct the model
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--max-out-len 100 \ # Maximum number of tokens to generate
--batch-size 64 \ # Batch size
--num-gpus 1 # Number of GPUs required to run the model
```
```{seealso}
For all HuggingFace related parameters supported by `run.py`, please read [Launching Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task).
```
:::
````
````{tab} Configuration File
In addition to configuring the experiment through the command line, OpenCompass also allows users to write the full configuration of the experiment in a configuration file and run it directly through `run.py`. The configuration file is organized in Python format and must include the `datasets` and `models` fields.
The test configuration for this time is [configs/eval_demo.py](https://github.com/open-compass/opencompass/blob/main/configs/eval_demo.py). This configuration introduces the required dataset and model configurations through the [inheritance mechanism](../user_guides/config.md#inheritance-mechanism) and combines the `datasets` and `models` fields in the required format.
```python
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]
```
When running tasks, we just need to pass the path of the configuration file to `run.py`:
```bash
python run.py configs/eval_demo.py
```
:::{dropdown} More about `models`
:animate: fade-in-slide-down
OpenCompass provides a series of pre-defined model configurations under `configs/models`. Below is the configuration snippet related to [opt-350m](https://github.com/open-compass/opencompass/blob/main/configs/models/opt/hf_opt_350m.py) (`configs/models/opt/hf_opt_350m.py`):
```python
# Evaluate models supported by HuggingFace's `AutoModelForCausalLM` using `HuggingFaceCausalLM`
from opencompass.models import HuggingFaceCausalLM
# OPT-350M
opt350m = dict(
type=HuggingFaceCausalLM,
# Initialization parameters for `HuggingFaceCausalLM`
path='facebook/opt-350m',
tokenizer_path='facebook/opt-350m',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
proxies=None,
trust_remote_code=True),
model_kwargs=dict(device_map='auto'),
# Below are common parameters for all models, not specific to HuggingFaceCausalLM
abbr='opt350m', # Model abbreviation for result display
max_seq_len=2048, # The maximum length of the entire sequence
max_out_len=100, # Maximum number of generated tokens
batch_size=64, # batchsize
run_cfg=dict(num_gpus=1), # The required GPU numbers for this model
)
```
When using configurations, we can specify the relevant files through the command-line argument ` --models` or import the model configurations into the `models` list in the configuration file using the inheritance mechanism.
```{seealso}
More information about model configuration can be found in [Prepare Models](../user_guides/models.md).
```
:::
:::{dropdown} More about `datasets`
:animate: fade-in-slide-down
Similar to models, dataset configuration files are provided under `configs/datasets`. Users can use `--datasets` in the command line or import related configurations in the configuration file via inheritance
Below is a dataset-related configuration snippet from `configs/eval_demo.py`:
```python
from mmengine.config import read_base # Use mmengine.read_base() to read the base configuration
with read_base():
# Directly read the required dataset configurations from the preset dataset configurations
from .datasets.winograd.winograd_ppl import winograd_datasets # Read Winograd configuration, evaluated based on PPL (perplexity)
from .datasets.siqa.siqa_gen import siqa_datasets # Read SIQA configuration, evaluated based on generation
datasets = [*siqa_datasets, *winograd_datasets] # The final config needs to contain the required evaluation dataset list 'datasets'
```
Dataset configurations are typically of two types: 'ppl' and 'gen', indicating the evaluation method used. Where `ppl` means discriminative evaluation and `gen` means generative evaluation.
Moreover, [configs/datasets/collections](https://github.com/open-compass/opencompass/blob/main/configs/datasets/collections) houses various dataset collections, making it convenient for comprehensive evaluations. OpenCompass often uses [`base_medium.py`](/configs/datasets/collections/base_medium.py) for full-scale model testing. To replicate results, simply import that file, for example:
```bash
python run.py --models hf_llama_7b --datasets base_medium
```
```{seealso}
You can find more information from [Dataset Preparation](../user_guides/datasets.md).
```
:::
````
`````
```{warning}
OpenCompass usually assumes network is available. If you encounter network issues or wish to run OpenCompass in an offline environment, please refer to [FAQ - Network - Q1](./faq.md#network) for solutions.
```
The following sections will use configuration-based method as an example to explain the other features.
## Launching Evaluation
Since OpenCompass launches evaluation processes in parallel by default, we can start the evaluation in `--debug` mode for the first run and check if there is any problem. In `--debug` mode, the tasks will be executed sequentially and output will be printed in real time.
```bash
python run.py configs/eval_demo.py -w outputs/demo --debug
```
The pretrained models 'facebook/opt-350m' and 'facebook/opt-125m' will be automatically downloaded from HuggingFace during the first run.
If everything is fine, you should see "Starting inference process" on screen:
```bash
[2023-07-12 18:23:55,076] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
```
Then you can press `ctrl+c` to interrupt the program, and run the following command in normal mode:
```bash
python run.py configs/eval_demo.py -w outputs/demo
```
In normal mode, the evaluation tasks will be executed parallelly in the background, and their output will be redirected to the output directory `outputs/demo/{TIMESTAMP}`. The progress bar on the frontend only indicates the number of completed tasks, regardless of their success or failure. **Any backend task failures will only trigger a warning message in the terminal.**
:::{dropdown} More parameters in `run.py`
:animate: fade-in-slide-down
Here are some parameters related to evaluation that can help you configure more efficient inference tasks based on your environment:
- `-w outputs/demo`: Work directory to save evaluation logs and results. In this case, the experiment result will be saved to `outputs/demo/{TIMESTAMP}`.
- `-r`: Reuse existing inference results, and skip the finished tasks. If followed by a timestamp, the result under that timestamp in the workspace path will be reused; otherwise, the latest result in the specified workspace path will be reused.
- `--mode all`: Specify a specific stage of the task.
- all: (Default) Perform a complete evaluation, including inference and evaluation.
- infer: Perform inference on each dataset.
- eval: Perform evaluation based on the inference results.
- viz: Display evaluation results only.
- `--max-partition-size 2000`: Dataset partition size. Some datasets may be large, and using this parameter can split them into multiple sub-tasks to efficiently utilize resources. However, if the partition is too fine, the overall speed may be slower due to longer model loading times.
- `--max-num-workers 32`: Maximum number of parallel tasks. In distributed environments such as Slurm, this parameter specifies the maximum number of submitted tasks. In a local environment, it specifies the maximum number of tasks executed in parallel. Note that the actual number of parallel tasks depends on the available GPU resources and may not be equal to this number.
If you are not performing the evaluation on your local machine but using a Slurm cluster, you can specify the following parameters:
- `--slurm`: Submit tasks using Slurm on the cluster.
- `--partition(-p) my_part`: Slurm cluster partition.
- `--retry 2`: Number of retries for failed tasks.
```{seealso}
The entry also supports submitting tasks to Alibaba Deep Learning Center (DLC), and more customized evaluation strategies. Please refer to [Launching an Evaluation Task](../user_guides/experimentation.md#launching-an-evaluation-task) for details.
```
:::
## Visualizing Evaluation Results
After the evaluation is complete, the evaluation results table will be printed as follows:
```text
dataset version metric mode opt350m opt125m
--------- --------- -------- ------ --------- ---------
siqa e78df3 accuracy gen 21.55 12.44
winograd b6c7ed accuracy ppl 51.23 49.82
```
All run outputs will be directed to `outputs/demo/` directory with following structure:
```text
outputs/default/
βββ 20200220_120000
βββ 20230220_183030 # one experiment pre folder
β βββ configs # Dumped config files for record. Multiple configs may be kept if different experiments have been re-run on the same experiment folder
β βββ logs # log files for both inference and evaluation stages
β β βββ eval
β β βββ infer
βΒ Β βββ predictions # Prediction results for each task
βΒ Β βββ results # Evaluation results for each task
βΒ Β βββ summary # Summarized evaluation results for a single experiment
βββ ...
```
The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
More information about obtaining evaluation results can be found in [Results Summary](../user_guides/summarizer.md).
## Additional Tutorials
To learn more about using OpenCompass, explore the following tutorials:
- [Prepare Datasets](../user_guides/datasets.md)
- [Prepare Models](../user_guides/models.md)
- [Task Execution and Monitoring](../user_guides/experimentation.md)
- [Understand Prompts](../prompt/overview.md)
- [Results Summary](../user_guides/summarizer.md)
- [Learn about Config](../user_guides/config.md)
|