markusbayer commited on
Commit
531435f
1 Parent(s): d252cd0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -146
README.md CHANGED
@@ -1,206 +1,106 @@
1
  ---
2
- license: unknown
3
  language:
4
  - en
5
  tags:
6
  - Cybersecurity
 
7
  - Information Security
8
  - Computer Science
 
 
 
 
 
9
  ---
10
  # Model Card for Model ID
11
 
12
  <!-- Provide a quick summary of what the model is/does. -->
13
 
14
- This modelcard aims to be a base model for cybersecurity Tasks.
 
15
 
16
  # Model Details
17
 
18
- ## Model Description
19
-
20
- <!-- Provide a longer summary of what this model is. -->
21
-
22
-
23
-
24
  - **Developed by:** Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter
25
  - **Model type:** BERT-base
26
  - **Language(s) (NLP):** English
27
  - **Finetuned from model:** bert-base-uncased.
28
 
29
- ## Model Sources [optional]
30
 
31
  <!-- Provide the basic links for the model. -->
32
 
33
- - **Repository:** Will be added later
34
- - **Paper:** https://arxiv.org/abs/2212.02974
35
-
36
- # Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ## Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
 
46
- ## Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- # Bias, Risks, and Limitations
53
 
54
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
55
 
56
- [More Information Needed]
57
-
58
- ## Recommendations
59
-
60
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
61
-
62
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
63
-
64
- ## How to Get Started with the Model
65
-
66
- Use the code below to get started with the model.
67
-
68
- [More Information Needed]
69
 
70
  # Training Details
71
 
72
  ## Training Data
73
 
74
- <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
75
 
76
- [More Information Needed]
77
 
78
- ## Training Procedure [optional]
79
-
80
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
81
-
82
- ### Preprocessing
83
-
84
- [More Information Needed]
85
-
86
- ### Speeds, Sizes, Times
87
-
88
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
89
-
90
- [More Information Needed]
91
 
92
  # Evaluation
93
 
94
- <!-- This section describes the evaluation protocols and provides the results. -->
95
-
96
- ## Testing Data, Factors & Metrics
97
-
98
- ### Testing Data
99
-
100
- <!-- This should link to a Data Card if possible. -->
101
-
102
- [More Information Needed]
103
-
104
- ### Factors
105
-
106
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
107
-
108
- [More Information Needed]
109
-
110
- ### Metrics
111
-
112
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
113
-
114
- [More Information Needed]
115
 
116
- ## Results
117
-
118
- [More Information Needed]
119
-
120
- ### Summary
121
-
122
-
123
-
124
- # Model Examination [optional]
125
-
126
- <!-- Relevant interpretability work for the model goes here -->
127
-
128
- [More Information Needed]
129
-
130
- # Environmental Impact
131
-
132
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
133
-
134
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
135
-
136
- - **Hardware Type:** [More Information Needed]
137
- - **Hours used:** [More Information Needed]
138
- - **Cloud Provider:** [More Information Needed]
139
- - **Compute Region:** [More Information Needed]
140
- - **Carbon Emitted:** [More Information Needed]
141
-
142
- # Technical Specifications [optional]
143
-
144
- ## Model Architecture and Objective
145
-
146
- [More Information Needed]
147
-
148
- ## Compute Infrastructure
149
-
150
- [More Information Needed]
151
-
152
- ### Hardware
153
-
154
- [More Information Needed]
155
-
156
- ### Software
157
-
158
- [More Information Needed]
159
-
160
- # Citation [optional]
161
 
162
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
163
 
164
  **BibTeX:**
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  @misc{https://doi.org/10.48550/arxiv.2212.02974,
167
  doi = {10.48550/ARXIV.2212.02974},
168
-
169
  url = {https://arxiv.org/abs/2212.02974},
170
-
171
  author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
172
-
173
  keywords = {Cryptography and Security (cs.CR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
174
-
175
  title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
176
-
177
  publisher = {arXiv},
178
-
179
  year = {2022},
180
-
181
  copyright = {arXiv.org perpetual, non-exclusive license}
182
  }
183
 
184
- **APA:**
185
-
186
- [More Information Needed]
187
-
188
- # Glossary [optional]
189
-
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
-
192
- [More Information Needed]
193
-
194
- # More Information [optional]
195
-
196
- [More Information Needed]
197
-
198
  # Model Card Authors [optional]
199
 
200
- [More Information Needed]
201
 
202
  # Model Card Contact
203
 
204
- [More Information Needed]
205
-
206
-
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
  - en
5
  tags:
6
  - Cybersecurity
7
+ - Cyber Security
8
  - Information Security
9
  - Computer Science
10
+ - Cyber Threats
11
+ - Vulnerabilities
12
+ - Vulnerability
13
+ - Malware
14
+ - Attacks
15
  ---
16
  # Model Card for Model ID
17
 
18
  <!-- Provide a quick summary of what the model is/does. -->
19
 
20
+ CySecBERT is a domain-adapted version of the BERT model tailored for cybersecurity tasks.
21
+ It is based on a [Cybersecurity Dataset](https://github.com/PEASEC/cybersecurity_dataset) consisting of 4.3 million entries of Twitter, Blogs, Paper, and CVEs related to the cybersecurity domain.
22
 
23
  # Model Details
24
 
 
 
 
 
 
 
25
  - **Developed by:** Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter
26
  - **Model type:** BERT-base
27
  - **Language(s) (NLP):** English
28
  - **Finetuned from model:** bert-base-uncased.
29
 
30
+ ## Model Sources
31
 
32
  <!-- Provide the basic links for the model. -->
33
 
34
+ - **Repository:** https://github.com/PEASEC/CySecBERT
35
+ - **Paper:** https://dl.acm.org/doi/abs/10.1145/3652594 and https://arxiv.org/abs/2212.02974
 
 
 
 
 
 
 
 
 
 
36
 
 
37
 
38
+ # Bias, Risks, Limitations, and Recommendations
 
 
 
 
39
 
40
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
41
+ We would like to emphasise that we did not explicitly focus on and analyse social biases in the data or the resulting model.
42
+ While this may not be so damaging for most application contexts, there are certainly applications that depend heavily on these biases, and including any kind of discrimination can have serious consequences.
43
+ As authors, we would like to express our warnings regarding the use of the model in such contexts.
44
+ Nonetheless, we aim for an open source mentality, observing the great impact it can have, and therefore transfer the thinking to the user of the model, drawing on the many previous discussions in the open source community.
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  # Training Details
48
 
49
  ## Training Data
50
 
51
+ See https://github.com/PEASEC/cybersecurity_dataset
52
 
53
+ ## Training Procedure
54
 
55
+ We have specifically trained CySecBERT not to be affected too much by catastrophic forgetting. More details can be found in the paper.
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  # Evaluation
58
 
59
+ We have performed many different cybersecurity and general evaluations. The details can be found in the paper.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ # Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
64
 
65
  **BibTeX:**
66
 
67
+ @article{10.1145/3652594,
68
+ author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
69
+ title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
70
+ year = {2024},
71
+ issue_date = {May 2024},
72
+ publisher = {Association for Computing Machinery},
73
+ address = {New York, NY, USA},
74
+ volume = {27},
75
+ number = {2},
76
+ issn = {2471-2566},
77
+ url = {https://doi.org/10.1145/3652594},
78
+ doi = {10.1145/3652594},
79
+ abstract = {The field of cysec is evolving fast. Security professionals are in need of intelligence on past, current and —ideally — upcoming threats, because attacks are becoming more advanced and are increasingly targeting larger and more complex systems. Since the processing and analysis of such large amounts of information cannot be addressed manually, cysec experts rely on machine learning techniques. In the textual domain, pre-trained language models such as Bidirectional Encoder Representations from Transformers (BERT) have proven to be helpful as they provide a good baseline for further fine-tuning. However, due to the domain-knowledge and the many technical terms in cysec, general language models might miss the gist of textual information. For this reason, we create a high-quality dataset1 and present a language model2 specifically tailored to the cysec domain that can serve as a basic building block for cybersecurity systems. The model is compared on 15 tasks: Domain-dependent extrinsic tasks for measuring the performance on specific problems, intrinsic tasks for measuring the performance of the internal representations of the model, as well as general tasks from the SuperGLUE benchmark. The results of the intrinsic tasks show that our model improves the internal representation space of domain words compared with the other models. The extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model performs best in cybersecurity scenarios. In addition, we pay special attention to the choice of hyperparameters against catastrophic forgetting, as pre-trained models tend to forget the original knowledge during further training.},
80
+ journal = {ACM Trans. Priv. Secur.},
81
+ month = {apr},
82
+ articleno = {18},
83
+ numpages = {20},
84
+ keywords = {Language model, cybersecurity BERT, cybersecurity dataset}
85
+ }
86
+
87
+ or
88
+
89
  @misc{https://doi.org/10.48550/arxiv.2212.02974,
90
  doi = {10.48550/ARXIV.2212.02974},
 
91
  url = {https://arxiv.org/abs/2212.02974},
 
92
  author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
 
93
  keywords = {Cryptography and Security (cs.CR), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
 
94
  title = {CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain},
 
95
  publisher = {arXiv},
 
96
  year = {2022},
 
97
  copyright = {arXiv.org perpetual, non-exclusive license}
98
  }
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  # Model Card Authors [optional]
101
 
102
+ Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, Christian Reuter
103
 
104
  # Model Card Contact
105
 
106