Javier Rando's picture

2 8 1

Javier Rando

javirandor

·

https://javirando.com

AI & ML interests

NLP, Safety

Organizations

javirandor's activity

upvoted a paper 5 months ago

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Paper • 2406.07954 • Published Jun 12 • 2

upvoted a paper 6 months ago

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Paper • 2404.14461 • Published Apr 22 • 2

upvoted a collection 6 months ago

RLHF Trojan Competition

Datasets and models used for the trojan detection competition co-located at SaTML 2024: https://github.com/ethz-spylab/rlhf_trojan_competition • 20 items • Updated Apr 30 • 4

upvoted a paper 9 months ago

Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1

upvoted a paper 12 months ago

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Paper • 2311.03348 • Published Nov 6, 2023 • 1

upvoted a paper about 1 year ago

Personas as a Way to Model Truthfulness in Language Models

Paper • 2310.18168 • Published Oct 27, 2023 • 5

upvoted 2 papers over 1 year ago

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Paper • 2307.15217 • Published Jul 27, 2023 • 36

Red-Teaming the Stable Diffusion Safety Filter

Paper • 2210.04610 • Published Oct 3, 2022 • 2