ealvaradob
/

bert-finetuned-phishing

@@ -3,6 +3,8 @@ license: apache-2.0
 base_model: bert-large-uncased
 tags:
 - generated_from_trainer
 metrics:
 - accuracy
 - precision
@@ -10,6 +12,29 @@ metrics:
 model-index:
 - name: bert-finetuned-phishing
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -17,8 +42,10 @@ should probably proofread and complete it, then remove this comment. -->
 # bert-finetuned-phishing
-This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.1953
 - Accuracy: 0.9717
 - Precision: 0.9658
@@ -27,17 +54,41 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -65,4 +116,4 @@ The following hyperparameters were used during training:
 - Transformers 4.34.1
 - Pytorch 2.1.1+cu121
 - Datasets 2.14.6
-- Tokenizers 0.14.1

 base_model: bert-large-uncased
 tags:
 - generated_from_trainer
+- phishing
+- BERT
 metrics:
 - accuracy
 - precision
 model-index:
 - name: bert-finetuned-phishing
   results: []
+widget:
+- text: https://www.verif22.com
+  example_title: Phishing URL
+- text: Dear colleague, An important update about your email has exceeded your
+    storage limit. You will not be able to send or receive all of your messages.
+    We will close all older versions of our Mailbox as of Friday, June 12, 2023.
+    To activate and complete the required information click here (https://ec-ec.squarespace.com).
+    Account must be reactivated today to regenerate new space. Management Team
+  example_title: Phishing Email
+- text: You have access to FREE Video Streaming in your plan. REGISTER with your email, password and
+    then select the monthly subscription option. https://bit.ly/3vNrU5r
+  example_title: Phishing SMS
+- text: if(data.selectedIndex > 0){$('#hidCflag').val(data.selectedData.value);};;
+    var sprypassword1 = new Spry.Widget.ValidationPassword("sprypassword1");
+    var sprytextfield1 = new Spry.Widget.ValidationTextField("sprytextfield1", "email");
+  example_tile: Phishing Script
+- text: Hi, this model is really accurate :)
+  example_title: Benign message
+datasets:
+- ealvaradob/phishing-dataset
+language:
+- en
+pipeline_tag: text-classification
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # bert-finetuned-phishing
+This model is a fine-tuned version of [bert-large-uncased](https://huggingface.co/bert-large-uncased) on an [phishing dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset).
 It achieves the following results on the evaluation set:
 - Loss: 0.1953
 - Accuracy: 0.9717
 - Precision: 0.9658
 ## Model description
+BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.
+This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why
+it can use lots of publicly available data) with an automatic process to generate inputs and labels from
+those texts. More precisely, it was pretrained with two objectives:
+- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input
+  then run the entire masked sentence through the model and has to predict the masked words. This is different
+  from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from
+  autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a
+  bidirectional representation of the sentence.
+- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining.
+  Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The
+  model then has to predict if the two sentences were following each other or not.
+This way, the model learns an inner representation of the English language that can then be used to extract
+features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a
+standard classifier using the features produced by the BERT model as inputs.
+This model has the following configuration:
+- 24-layer
+- 1024 hidden dimension
+- 16 attention heads
+- 336M parameters
 ## Intended uses & limitations
+This is a BERT model finely tuned for phishing detection in text entries.
 ## Training and evaluation data
+This model was finely tuned on a phishing dataset containing samples of: URLs, SMS messages, mail messages,
+and HTML code. This sample variability broadens the detection range of the model and allows it to be used in
+various contexts.
 ### Training hyperparameters
 - Transformers 4.34.1
 - Pytorch 2.1.1+cu121
 - Datasets 2.14.6
+- Tokenizers 0.14.1