RLHF Trojan Competition - a ethz-spylab Collection

ethz-spylab 's Collections

updated Apr 30

Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition

Upvote

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Paper • 2404.14461 • Published Apr 22 • 2
Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1
ethz-spylab/poisoned_generation_trojan1

Text Generation • Updated Apr 29 • 2.06k • 3
ethz-spylab/poisoned_generation_trojan2

Text Generation • Updated Apr 29 • 63 • 1
ethz-spylab/poisoned_generation_trojan3

Text Generation • Updated Apr 29 • 33 • 1
ethz-spylab/poisoned_generation_trojan4

Text Generation • Updated Apr 29 • 25 • 1
ethz-spylab/poisoned_generation_trojan5

Text Generation • Updated Apr 29 • 14 • 1
ethz-spylab/reward_model

Updated Apr 29 • 880 • 5
ethz-spylab/competition_trojan1

Viewer • Updated Mar 20 • 42.5k • 44
ethz-spylab/competition_trojan2

Viewer • Updated Mar 20 • 42.5k • 35
ethz-spylab/competition_trojan3

Viewer • Updated Mar 20 • 42.5k • 37
ethz-spylab/competition_trojan4

Viewer • Updated Mar 20 • 42.5k • 38
ethz-spylab/competition_trojan5

Viewer • Updated Mar 20 • 42.5k • 38
ethz-spylab/competition_reward_trojan1

Updated Mar 20 • 103

Note Reward model used to train poisoned_generation_trojan1 using PPO
ethz-spylab/competition_reward_trojan2

Updated Mar 20 • 11

Note Reward model used to train poisoned_generation_trojan2 using PPO
ethz-spylab/competition_reward_trojan3

Updated Mar 20 • 3

Note Reward model used to train poisoned_generation_trojan3 using PPO
ethz-spylab/competition_reward_trojan4

Updated Mar 20

Note Reward model used to train poisoned_generation_trojan4 using PPO
ethz-spylab/competition_reward_trojan5

Updated Mar 20 • 1

Note Reward model used to train poisoned_generation_trojan5 using PPO
ethz-spylab/competition_eval_dataset

Viewer • Updated Mar 20 • 2.31k • 267 • 1

Note Private test set used to evaluate the submissions to the competition
ethz-spylab/rlhf_trojan_dataset

Viewer • Updated Nov 20, 2023 • 42.5k • 265 • 6

Note Dataset given to participants to find the trojans in the generation models.

Upvote