TwT-6's picture
Upload 2667 files
256a159 verified
# Prompt Attack
We support prompt attack following the idea of [PromptBench](https://github.com/microsoft/promptbench). The main purpose here is to evaluate the robustness of prompt instruction, which means when attack/modify the prompt to instruct the task, how well can this task perform as the original task.
## Set up environment
Some components are necessary to prompt attack experiment, therefore we need to set up environments.
```shell
git clone https://github.com/microsoft/promptbench.git
pip install textattack==0.3.8
export PYTHONPATH=$PYTHONPATH:promptbench/
```
## How to attack
### Add a dataset config
We will use GLUE-wnli dataset as example, most configuration settings can refer to [config.md](../user_guides/config.md) for help.
First we need support the basic dataset config, you can find the existing config files in `configs` or support your own config according to [new-dataset](./new_dataset.md)
Take the following `infer_cfg` as example, we need to define the prompt template. `adv_prompt` is the basic prompt placeholder to be attacked in the experiment. `sentence1` and `sentence2` are the input columns of this dataset. The attack will only modify the `adv_prompt` here.
Then, we should use `AttackInferencer` with `original_prompt_list` and `adv_key` to tell the inferencer where to attack and what text to be attacked.
More details can refer to `configs/datasets/promptbench/promptbench_wnli_gen_50662f.py` config file.
```python
original_prompt_list = [
'Are the following two sentences entailment or not_entailment? Answer me with "A. entailment" or "B. not_entailment", just one word. ',
"Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'.",
...,
]
wnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt="""{adv_prompt}
Sentence 1: {sentence1}
Sentence 2: {sentence2}
Answer:"""),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(
type=AttackInferencer,
original_prompt_list=original_prompt_list,
adv_key='adv_prompt'))
```
### Add a eval config
We should use `OpenICLAttackTask` here for attack task. Also `NaivePartitioner` should be used because the attack experiment will run the whole dataset repeatedly for nearly hurdurds times to search the best attack, we do not want to split the dataset for convenience.
```note
Please choose a small dataset(example < 1000) for attack, due to the aforementioned repeated search, otherwise the time cost is enumerous.
```
There are several other options in `attack` config:
- `attack`: attack type, available options includes `textfooler`, `textbugger`, `deepwordbug`, `bertattack`, `checklist`, `stresstest`;
- `query_budget`: upper boundary of queries, which means the total numbers of running the dataset;
- `prompt_topk`: number of topk prompt to be attacked. In most case, the original prompt list is great than 10, running the whole set is time consuming.
```python
# Please run whole dataset at a time, aka use `NaivePartitioner` only
# Please use `OpenICLAttackTask` if want to perform attack experiment
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=SlurmRunner,
max_num_workers=8,
task=dict(type=OpenICLAttackTask),
retry=0),
)
attack = dict(
attack='textfooler',
query_budget=100,
prompt_topk=2,
)
```
### Run the experiment
Please use `--mode infer` when run the attack experiment, and set `PYTHONPATH` env.
```shell
python run.py configs/eval_attack.py --mode infer
```
All the results will be saved in `attack` folder.
The content includes the original prompt accuracy and the attacked prompt with dropped accuracy of `topk` prompt, for instance:
```
Prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'., acc: 59.15%
Prompt: Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'., acc: 57.75%
Prompt: Analyze the two provided sentences and decide if their relationship is 'A. entailment' or 'B. not_entailment'., acc: 56.34%
Prompt: Identify whether the given pair of sentences demonstrates entailment or not_entailment. Answer with 'A. entailment' or 'B. not_entailment'., acc: 54.93%
...
Original prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'.
Attacked prompt: b"Assess the attach between the following sentences and sorted it as 'A. entailment' or 'B. not_entailment'."
Original acc: 59.15%, attacked acc: 40.85%, dropped acc: 18.31%
```