Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Paper
•
2404.14461
•
Published
•
2
Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition
Note Reward model used to train poisoned_generation_trojan1 using PPO
Note Reward model used to train poisoned_generation_trojan2 using PPO
Note Reward model used to train poisoned_generation_trojan3 using PPO
Note Reward model used to train poisoned_generation_trojan4 using PPO
Note Reward model used to train poisoned_generation_trojan5 using PPO
Note Private test set used to evaluate the submissions to the competition
Note Dataset given to participants to find the trojans in the generation models.