Mert
Sengil
·
AI & ML interests
LLM's
Recent Activity
Reacted to
gabrielchua's
post
with 👀
1 day ago
Sharing my first paper!
==
Large Language Models (LLMs) are powerful, but they're prone to off-topic misuse, where users push them beyond their intended scope. Think harmful prompts, jailbreaks, and misuse. So how do we build better guardrails?
Traditional guardrails rely on curated examples or classifiers. The problem?
⚠️ High false-positive rates
⚠️ Poor adaptability to new misuse types
⚠️ Require real-world data, which is often unavailable during pre-production
Our method skips the need for real-world misuse examples. Instead, we:
1️⃣ Define the problem space qualitatively
2️⃣ Use an LLM to generate synthetic misuse prompts
3️⃣ Train and test guardrails on this dataset
We apply this to the off-topic prompt detection problem, and fine-tune simple bi- and cross-encoder classifiers that outperform heuristics based on cosine similarity or prompt engineering.
Additionally, framing the problem as prompt relevance allows these fine-tuned classifiers to generalise to other risk categories (e.g., jailbreak, toxic prompts).
Through this work, we also open-source our dataset (2M examples, ~50M+ tokens) and models.
paper: https://huggingface.co/papers/2411.12946
artifacts: https://huggingface.co/collections/govtech/off-topic-guardrail-673838a62e4c661f248e81a4
liked
a Space
3 days ago
openfree/trending-board
Organizations
None yet
Sengil's activity
How to get best result?
#31 opened about 1 month ago
by
Sengil
Faster MusicGen Generation with Streaming
2
#23 opened about 1 year ago
by
sanchit-gandhi
Question
1
#28 opened 8 months ago
by
Chelik
how to get best result
#22 opened about 1 month ago
by
Sengil
GPU and memory requirements
2
#89 opened 2 months ago
by
Sengil
How can I make this model run faster?
3
#78 opened 3 months ago
by
Sengil
How to plot results using supervision
2
#5 opened 5 months ago
by
vishwamgupta