ML-Based Phishing Website Detection System
This project includes code based on publicly available content and might not have been entirely written by me.
ML-Based Phishing Website Detection System (Content-Based)
This project is part of a research-based effort focused on detecting phishing websites using content-based features like HTML tags. The repository includes code for feature extraction, data collection, preparation, and building machine learning models to classify websites as phishing or legitimate.
File Inputs
- CSV files containing phishing and legitimate URLs:
verified_online.csv- Phishing URLs from Phishtank.orgtranco_list.csv- Legitimate URLs from Tranco-list.eu
Process Overview
- Load URLs from CSV files.
- Fetch content for each URL using Python’s
requestslibrary. - Parse the content with
BeautifulSoupto extract numerical features. - Create a structured dataframe, add labels (1 for phishing, 0 for legitimate), and save the data as CSV files.
- See:
structured_data_legitimate.csvandstructured_data_phishing.csv
- See:
- Split the data for training and testing or use K-fold cross-validation (K=5) as shown in the
machine_learning.pyscript.
Implemented Machine Learning Models
- Support Vector Machine (SVM)
- Naive Bayes
- Decision Tree
- Random Forest
- AdaBoost
Evaluate models using accuracy, precision, recall, and visualize performance.
Dataset
You can create your own dataset using the data_collector.py script with a custom URL list. I used “phishtank.org” and “tranco-list.eu” as data sources.
- Total websites: 24,228
- 12,001 legitimate websites
- 12,227 phishing websites
- Dataset was created in January 2024.
- I may update the dataset every year.
Website
Visit the live demo at, feel free to reach me out, if site is frozen due not active users: ML Based Phishing Detection Visit my Github, if you need the source code.