Hyperion 🏹

Hyperion is an extremely lightweight (435M parameters) RoBERTa-based binary classifier that detects jailbreak/prompt injection attempts with 88% accuracy based on test cases.

We are continously releasing open-source models created during our research on prompt injection & model alignment. These models are not state of the art, but a very limited preview of our current capabilities. Hyperion was one of our very early tests in our process to build infrastructure for real time jailbreak detection. Smaller models and lightweight infrastructure are cheaper and faster to provide rapid responses to the emerging cat and mouse game for adversarial prompting. To learn more about us, visit our website!

Intended Use

Binary classification to detect prompt injection and related language model jailbreak techniques.

Training Data

Data Source: Preliminary proof of concept dataset of publicly available red and blue team data
Data Size: 100k rows
Data Composition: 50% false, 50% true (extra data for this model was tossed out)

Validation Metrics

Loss: 0.347
Accuracy: 0.876
Precision: 0.876
Recall: 0.875
AUC: 0.951
F1: 0.876

Considerations

This model has only been evaluated on a limited proof of concept dataset and has not been thoroughly tested.
This model is observed to be overly aggressive in screening.

Caveats and Recommendations

This is an early stage research model and has not been validated for real world use (use at your own risk!).
Further testing on larger, more diverse datasets is recommended before considering production deployment.
Monitor for potential biased performance across different demographic groups.