Prompt Attack
We support prompt attack following the idea of PromptBench. The main purpose here is to evaluate the robustness of prompt instruction, which means when attack/modify the prompt to instruct the task, how well can this task perform as the original task.
Set up environment
Some components are necessary to prompt attack experiment, therefore we need to set up environments.
git clone https://github.com/microsoft/promptbench.git
pip install textattack==0.3.8
export PYTHONPATH=$PYTHONPATH:promptbench/
How to attack
Add a dataset config
We will use GLUE-wnli dataset as example, most configuration settings can refer to config.md for help.
First we need support the basic dataset config, you can find the existing config files in configs
or support your own config according to new-dataset
Take the following infer_cfg
as example, we need to define the prompt template. adv_prompt
is the basic prompt placeholder to be attacked in the experiment. sentence1
and sentence2
are the input columns of this dataset. The attack will only modify the adv_prompt
here.
Then, we should use AttackInferencer
with original_prompt_list
and adv_key
to tell the inferencer where to attack and what text to be attacked.
More details can refer to configs/datasets/promptbench/promptbench_wnli_gen_50662f.py
config file.
original_prompt_list = [
'Are the following two sentences entailment or not_entailment? Answer me with "A. entailment" or "B. not_entailment", just one word. ',
"Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'.",
...,
]
wnli_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt="""{adv_prompt}
Sentence 1: {sentence1}
Sentence 2: {sentence2}
Answer:"""),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(
type=AttackInferencer,
original_prompt_list=original_prompt_list,
adv_key='adv_prompt'))
Add a eval config
We should use OpenICLAttackTask
here for attack task. Also NaivePartitioner
should be used because the attack experiment will run the whole dataset repeatedly for nearly hurdurds times to search the best attack, we do not want to split the dataset for convenience.
Please choose a small dataset(example < 1000) for attack, due to the aforementioned repeated search, otherwise the time cost is enumerous.
There are several other options in attack
config:
attack
: attack type, available options includestextfooler
,textbugger
,deepwordbug
,bertattack
,checklist
,stresstest
;query_budget
: upper boundary of queries, which means the total numbers of running the dataset;prompt_topk
: number of topk prompt to be attacked. In most case, the original prompt list is great than 10, running the whole set is time consuming.
# Please run whole dataset at a time, aka use `NaivePartitioner` only
# Please use `OpenICLAttackTask` if want to perform attack experiment
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=SlurmRunner,
max_num_workers=8,
task=dict(type=OpenICLAttackTask),
retry=0),
)
attack = dict(
attack='textfooler',
query_budget=100,
prompt_topk=2,
)
Run the experiment
Please use --mode infer
when run the attack experiment, and set PYTHONPATH
env.
python run.py configs/eval_attack.py --mode infer
All the results will be saved in attack
folder.
The content includes the original prompt accuracy and the attacked prompt with dropped accuracy of topk
prompt, for instance:
Prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'., acc: 59.15%
Prompt: Does the relationship between the given sentences represent entailment or not_entailment? Respond with 'A. entailment' or 'B. not_entailment'., acc: 57.75%
Prompt: Analyze the two provided sentences and decide if their relationship is 'A. entailment' or 'B. not_entailment'., acc: 56.34%
Prompt: Identify whether the given pair of sentences demonstrates entailment or not_entailment. Answer with 'A. entailment' or 'B. not_entailment'., acc: 54.93%
...
Original prompt: Assess the connection between the following sentences and classify it as 'A. entailment' or 'B. not_entailment'.
Attacked prompt: b"Assess the attach between the following sentences and sorted it as 'A. entailment' or 'B. not_entailment'."
Original acc: 59.15%, attacked acc: 40.85%, dropped acc: 18.31%