question

#1
by yjas - opened

Hello,
I'm curious as to why the URL features have been extracted since usually an advantage of BERT is that data preprocessing such as feature extraction would not be necessary? Did you find that extracting URL features improved the model or was there another reason?

Hi @yjas ,
Yes, you are right. There is no need for feature extraction from URLs. In a previous version of the model, it is true that I trained by extracting the features. But in this version of the final model, I performed the training by simply passing the URLs as is. However, for an analysis I performed on this repository, I did extract features from URLs that relate to phishing.

I'm interested in testing bert for phishing urls only, but I'm a bit confused on the process and cannot find much documentation for it and I'm not 100% sure on the process of seeing how you've done it as I'm new to hugging face and your repo version seems to differ than hugging face so any guidance or confirmation of my understanding would be much appreciated. So my understanding is after collecting my URLs dataset and labelling them, data preprocessing like duplications and checking for a balanced datasest can be done then tokenzing the URLs using the hugging face guide (without any feature extraction and just passing the URL strings alone) then fine-tuning the model and improving the model through evaluation methods then testing /validating the performance and improving model accuracy - would this be the correct approach?

Also, if I were to compare the performance of BERT with traditional machine learning models (which usually require feature extraction), would I have to also add feature extraction to BERT or can I just do the comparison without the feature extraction?

Yes that's exactly what I did. First remove empty, duplicate or missing values, and balance the dataset. In my case I used a dataset that was already balanced. Second separate into a training and test set. Third, tokenize the URLs using the BERT tokenizer. Remember that each model uses its own tokenizer. Finally the evaluation of the model on the test set using some metrics. I consider that the most important is the Recall.

To compare performance with other models, in my opinion, I don't think it would be necessary to add feature extraction in BERT. For the other models that do not use tokenizers you could use text vectorizers that perform a different function but focus on converting the text into an understandable format. If you want, to guide you, you can review this comparison that I made with BERT vs the following models: XGBoost, Multinomial Naive Bayes and LSTM-CNN. Here I mainly used TF-IDF Vectorizer, and TextVectorization from Keras. I hope it is useful for you.

Sign up or log in to comment