AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Paper • 2310.15140 • Published Oct 23, 2023 • 1
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models Paper • 2409.00598 • Published Sep 1