license: apache-2.0
language:
- en
tags:
- Cybersecurity
- Cyber Security
- Information Security
- Computer Science
- Cyber Threats
- Vulnerabilities
- Vulnerability
- Malware
- Attacks
Model Card for Model ID
CySecBERT is a domain-adapted version of the BERT model tailored for cybersecurity tasks. It is based on a Cybersecurity Dataset consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain.
Model Details
- Developed by: Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter
- Model type: BERT-base
- Language(s) (NLP): English
- Finetuned from model: bert-base-uncased.
Model Sources
- Repository: https://github.com/PEASEC/CySecBERT
- Paper: https://dl.acm.org/doi/abs/10.1145/3652594 and https://arxiv.org/abs/2212.02974
Bias, Risks, Limitations, and Recommendations
We would like to emphasise that we did not explicitly focus on and analyse social biases in the data or the resulting model. While this may not be so damaging for most application contexts, there are certainly applications that depend heavily on these biases, and including any kind of discrimination can have serious consequences. As authors, we would like to express our warnings regarding the use of the model in such contexts. Nonetheless, we aim for an open source mentality, observing the great impact it can have, and therefore transfer the thinking to the user of the model, drawing on the many previous discussions in the open source community.
Training Details
Training Data
See https://github.com/PEASEC/cybersecurity_dataset
Training Procedure
We have specifically trained CySecBERT not to be affected too much by catastrophic forgetting. More details can be found in the paper.
Evaluation
We have performed many different cybersecurity and general evaluations. The details can be found in the paper.
Citation
BibTeX:
@article{10.1145/3652594,
author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
year = {2024},
issue_date = {May 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {27},
number = {2},
issn = {2471-2566},
url = {https://doi.org/10.1145/3652594},
doi = {10.1145/3652594},
journal = {ACM Trans. Priv. Secur.},
month = {apr},
articleno = {18},
numpages = {20},
keywords = {Language model, cybersecurity BERT, cybersecurity dataset}
}
or
@misc{https://doi.org/10.48550/arxiv.2212.02974,
doi = {10.48550/ARXIV.2212.02974},
url = {https://arxiv.org/abs/2212.02974},
author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
keywords = {Cryptography and Security (cs.CR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
Model Card Authors
Markus Bayer