markusbayer
/

CySecBERT

@@ -1,206 +1,106 @@
 ---
-license: unknown
 language:
 - en
 tags:
 - Cybersecurity
 - Information Security
 - Computer Science
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base model for cybersecurity Tasks.
 # Model Details
-## Model Description
-<!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter
 - **Model type:** BERT-base
 - **Language(s) (NLP):** English
 - **Finetuned from model:** bert-base-uncased.
-## Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** Will be added later
-- **Paper:** https://arxiv.org/abs/2212.02974
-# Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-## Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-## Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-# Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-## Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 # Training Details
 ## Training Data
-<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-## Training Procedure [optional]
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-### Preprocessing
-[More Information Needed]
-### Speeds, Sizes, Times
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 # Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-## Testing Data, Factors & Metrics
-### Testing Data
-<!-- This should link to a Data Card if possible. -->
-[More Information Needed]
-### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-## Results
-[More Information Needed]
-### Summary
-# Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-# Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-# Technical Specifications [optional]
-## Model Architecture and Objective
-[More Information Needed]
-## Compute Infrastructure
-[More Information Needed]
-### Hardware
-[More Information Needed]
-### Software
-[More Information Needed]
-# Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
 @misc{https://doi.org/10.48550/arxiv.2212.02974,
   doi = {10.48550/ARXIV.2212.02974},
   url = {https://arxiv.org/abs/2212.02974},
   author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
   keywords = {Cryptography and Security (cs.CR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
   title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
   publisher = {arXiv},
   year = {2022},
   copyright = {arXiv.org perpetual, non-exclusive license}
 }
-**APA:**
-[More Information Needed]
-# Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-# More Information [optional]
-[More Information Needed]
 # Model Card Authors [optional]
-[More Information Needed]
 # Model Card Contact
-[More Information Needed]

 ---
+license: apache-2.0
 language:
 - en
 tags:
 - Cybersecurity
+- Cyber Security
 - Information Security
 - Computer Science
+- Cyber Threats
+- Vulnerabilities
+- Vulnerability
+- Malware
+- Attacks
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
+CySecBERT is a domain-adapted version of the BERT model tailored for cybersecurity tasks.
+It is based on a [Cybersecurity Dataset](https://github.com/PEASEC/cybersecurity_dataset) consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain.
 # Model Details
 - **Developed by:** Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter
 - **Model type:** BERT-base
 - **Language(s) (NLP):** English
 - **Finetuned from model:** bert-base-uncased.
+## Model Sources
 <!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/PEASEC/CySecBERT
+- **Paper:** https://dl.acm.org/doi/abs/10.1145/3652594 and https://arxiv.org/abs/2212.02974
+# Bias, Risks, Limitations, and Recommendations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
+We would like to emphasise that we did not explicitly focus on and analyse social biases in the data or the resulting model.
+While this may not be so damaging for most application contexts, there are certainly applications that depend heavily on these biases, and including any kind of discrimination can have serious consequences.
+As authors, we would like to express our warnings regarding the use of the model in such contexts.
+Nonetheless, we aim for an open source mentality, observing the great impact it can have, and therefore transfer the thinking to the user of the model, drawing on the many previous discussions in the open source community.
 # Training Details
 ## Training Data
+See https://github.com/PEASEC/cybersecurity_dataset
+## Training Procedure
+We have specifically trained CySecBERT not to be affected too much by catastrophic forgetting. More details can be found in the paper.
 # Evaluation
+We have performed many different cybersecurity and general evaluations. The details can be found in the paper.
+# Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
+@article{10.1145/3652594,
+author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
+title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
+year = {2024},
+issue_date = {May 2024},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {27},
+number = {2},
+issn = {2471-2566},
+url = {https://doi.org/10.1145/3652594},
+doi = {10.1145/3652594},
+abstract = {The field of cysec is evolving fast. Security professionals are in need of intelligence on past, current and —ideally — upcoming threats, because attacks are becoming more advanced and are increasingly targeting larger and more complex systems. Since the processing and analysis of such large amounts of information cannot be addressed manually, cysec experts rely on machine learning techniques. In the textual domain, pre-trained language models such as Bidirectional Encoder Representations from Transformers (BERT) have proven to be helpful as they provide a good baseline for further fine-tuning. However, due to the domain-knowledge and the many technical terms in cysec, general language models might miss the gist of textual information. For this reason, we create a high-quality dataset1 and present a language model2 specifically tailored to the cysec domain that can serve as a basic building block for cybersecurity systems. The model is compared on 15 tasks: Domain-dependent extrinsic tasks for measuring the performance on specific problems, intrinsic tasks for measuring the performance of the internal representations of the model, as well as general tasks from the SuperGLUE benchmark. The results of the intrinsic tasks show that our model improves the internal representation space of domain words compared with the other models. The extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model performs best in cybersecurity scenarios. In addition, we pay special attention to the choice of hyperparameters against catastrophic forgetting, as pre-trained models tend to forget the original knowledge during further training.},
+journal = {ACM Trans. Priv. Secur.},
+month = {apr},
+articleno = {18},
+numpages = {20},
+keywords = {Language model, cybersecurity BERT, cybersecurity dataset}
+}
+or
 @misc{https://doi.org/10.48550/arxiv.2212.02974,
   doi = {10.48550/ARXIV.2212.02974},
   url = {https://arxiv.org/abs/2212.02974},
   author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
   keywords = {Cryptography and Security (cs.CR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
   title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
   publisher = {arXiv},
   year = {2022},
   copyright = {arXiv.org perpetual, non-exclusive license}
 }
 # Model Card Authors [optional]
+Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, Christian Reuter
 # Model Card Contact
+[email protected]