ealvaradob commited on
Commit
dd4e487
1 Parent(s): 4ac6cf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -7
README.md CHANGED
@@ -3,6 +3,8 @@ license: apache-2.0
3
  base_model: bert-large-uncased
4
  tags:
5
  - generated_from_trainer
 
 
6
  metrics:
7
  - accuracy
8
  - precision
@@ -10,6 +12,29 @@ metrics:
10
  model-index:
11
  - name: bert-finetuned-phishing
12
  results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -17,8 +42,10 @@ should probably proofread and complete it, then remove this comment. -->
17
 
18
  # bert-finetuned-phishing
19
 
20
- This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an unknown dataset.
 
21
  It achieves the following results on the evaluation set:
 
22
  - Loss: 0.1953
23
  - Accuracy: 0.9717
24
  - Precision: 0.9658
@@ -27,17 +54,41 @@ It achieves the following results on the evaluation set:
27
 
28
  ## Model description
29
 
30
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Intended uses & limitations
33
 
34
- More information needed
35
 
36
  ## Training and evaluation data
37
 
38
- More information needed
39
-
40
- ## Training procedure
41
 
42
  ### Training hyperparameters
43
 
@@ -65,4 +116,4 @@ The following hyperparameters were used during training:
65
  - Transformers 4.34.1
66
  - Pytorch 2.1.1+cu121
67
  - Datasets 2.14.6
68
- - Tokenizers 0.14.1
 
3
  base_model: bert-large-uncased
4
  tags:
5
  - generated_from_trainer
6
+ - phishing
7
+ - BERT
8
  metrics:
9
  - accuracy
10
  - precision
 
12
  model-index:
13
  - name: bert-finetuned-phishing
14
  results: []
15
+ widget:
16
+ - text: https://www.verif22.com
17
+ example_title: Phishing URL
18
+ - text: Dear colleague, An important update about your email has exceeded your
19
+ storage limit. You will not be able to send or receive all of your messages.
20
+ We will close all older versions of our Mailbox as of Friday, June 12, 2023.
21
+ To activate and complete the required information click here (https://ec-ec.squarespace.com).
22
+ Account must be reactivated today to regenerate new space. Management Team
23
+ example_title: Phishing Email
24
+ - text: You have access to FREE Video Streaming in your plan. REGISTER with your email, password and
25
+ then select the monthly subscription option. https://bit.ly/3vNrU5r
26
+ example_title: Phishing SMS
27
+ - text: if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};;
28
+ var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
29
+ var sprytextfield1 = new Spry.Widget.ValidationTextField("sprytextfield1", "email");
30
+ example_tile: Phishing Script
31
+ - text: Hi, this model is really accurate :)
32
+ example_title: Benign message
33
+ datasets:
34
+ - ealvaradob/phishing-dataset
35
+ language:
36
+ - en
37
+ pipeline_tag: text-classification
38
  ---
39
 
40
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
42
 
43
  # bert-finetuned-phishing
44
 
45
+ This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an [phishing dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset).
46
+
47
  It achieves the following results on the evaluation set:
48
+
49
  - Loss: 0.1953
50
  - Accuracy: 0.9717
51
  - Precision: 0.9658
 
54
 
55
  ## Model description
56
 
57
+ BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.
58
+ This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why
59
+ it can use lots of publicly available data) with an automatic process to generate inputs and labels from
60
+ those texts. More precisely, it was pretrained with two objectives:
61
+
62
+ - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input
63
+ then run the entire masked sentence through the model and has to predict the masked words. This is different
64
+ from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from
65
+ autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a
66
+ bidirectional representation of the sentence.
67
+
68
+ - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining.
69
+ Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The
70
+ model then has to predict if the two sentences were following each other or not.
71
+
72
+ This way, the model learns an inner representation of the English language that can then be used to extract
73
+ features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a
74
+ standard classifier using the features produced by the BERT model as inputs.
75
+
76
+ This model has the following configuration:
77
+
78
+ - 24-layer
79
+ - 1024 hidden dimension
80
+ - 16 attention heads
81
+ - 336M parameters
82
 
83
  ## Intended uses & limitations
84
 
85
+ This is a BERT model finely tuned for phishing detection in text entries.
86
 
87
  ## Training and evaluation data
88
 
89
+ This model was finely tuned on a phishing dataset containing samples of: URLs, SMS messages, mail messages,
90
+ and HTML code. This sample variability broadens the detection range of the model and allows it to be used in
91
+ various contexts.
92
 
93
  ### Training hyperparameters
94
 
 
116
  - Transformers 4.34.1
117
  - Pytorch 2.1.1+cu121
118
  - Datasets 2.14.6
119
+ - Tokenizers 0.14.1