FredZhang7
/

malphish-eater-v1

Text Classification

Inference Endpoints

Model card Files Files and versions Community

malphish-eater-v1 / README.md

FredZhang7's picture

remove word

5a11720 over 1 year ago

|

2.83 kB

	---
	license: cc-by-4.0
	datasets:
	- FredZhang7/malicious-website-features-2.4M
	wget:
	- text: https://chat.openai.com/
	- text: https://huggingface.co/FredZhang7/aivance-safesearch-v3
	metrics:
	- accuracy
	language:
	- af
	- en
	- et
	- sw
	- sv
	- sq
	- de
	- ca
	- hu
	- da
	- tl
	- so
	- fi
	- fr
	- cs
	- hr
	- cy
	- es
	- sl
	- tr
	- pl
	- pt
	- nl
	- id
	- sk
	- lt
	- 'no'
	- lv
	- vi
	- it
	- ro
	- ru
	- mk
	- bg
	- th
	- ja
	- ko
	- multilingual
	---

	I'm releasing this model because v2 has made too many significant improvements in terms of dataset size, features, efficiency, robustness of feature extraction, and thoroughness that it makes v1 look too simple.

	The classification task for v1 is split into two stages:
	1. URL features model
	- 96.5%+ accurate on training and validation data
	- 2,436,727 rows of labelled URLs
	- evaluation from v2: slightly overfitted, by perhaps around 0.8%
	2. Website features model
	- 98.4% accurate on training data, and 98.9% accurate on validation data
	- 911,180 rows of 42 features
	- evaluation from v2: slightly biased towards the URL feature (bert_confidence) model more than the other columns

	## Training
	I applied cross-validation with `cv=5` to the training dataset to search for the best hyperparameters.
	Here's the dict passed to `sklearn`'s '`GridSearchCV` function:
	```python
	params = {
	'objective': 'binary',
	'metric': 'binary_logloss',
	'boosting_type': ['gbdt', 'dart'],
	'num_leaves': [15, 23, 31, 63],
	'learning_rate': [0.001, 0.002, 0.01, 0.02],
	'feature_fraction': [0.5, 0.6, 0.7, 0.9],
	'early_stopping_rounds': [10, 20],
	'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
	}
	```
	To reproduce the 98.4% accurate model, you can follow the data analysis on the [dataset page](https://huggingface.co/datasets/FredZhang7/malicious-website-features-2.4M) to filter out the unimportant features.
	Then train a LightGBM model using the most suited hyperparamters for this task:
	```python
	params = {
	'objective': 'binary',
	'metric': 'binary_logloss',
	'boosting_type': 'gbdt',
	'num_leaves': 31,
	'learning_rate': 0.01,
	'feature_fraction': 0.6,
	'early_stopping_rounds': 10,
	'num_boost_round': 800
	}
	```


	## URL Features
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
	model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
	```
	## Website Features
	```bash
	pip install lightgbm
	```
	```python
	import lightgbm as lgb
	lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")
	```

	## Attribution
	- If you distribute, remix, adapt, or build upon our work, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.