Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition Paper • 2406.07954 • Published Jun 12 • 2
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs Paper • 2404.14461 • Published Apr 22 • 2
RLHF Trojan Competition Collection Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition • 20 items • Updated Apr 30 • 4
Universal Jailbreak Backdoors from Poisoned Human Feedback Paper • 2311.14455 • Published Nov 24, 2023 • 1
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Paper • 2311.03348 • Published Nov 6, 2023 • 1
Personas as a Way to Model Truthfulness in Language Models Paper • 2310.18168 • Published Oct 27, 2023 • 5
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Paper • 2307.15217 • Published Jul 27, 2023 • 36