Post

ML-Based Phishing Website Detection System

ML-Based Phishing Website Detection System

This project includes code based on publicly available content and might not have been entirely written by me.

ML-Based Phishing Website Detection System (Content-Based)

This project is part of a research-based effort focused on detecting phishing websites using content-based features like HTML tags. The repository includes code for feature extraction, data collection, preparation, and building machine learning models to classify websites as phishing or legitimate.

File Inputs

  • CSV files containing phishing and legitimate URLs:
    • verified_online.csv - Phishing URLs from Phishtank.org
    • tranco_list.csv - Legitimate URLs from Tranco-list.eu

Process Overview

  1. Load URLs from CSV files.
  2. Fetch content for each URL using Python’s requests library.
  3. Parse the content with BeautifulSoup to extract numerical features.
  4. Create a structured dataframe, add labels (1 for phishing, 0 for legitimate), and save the data as CSV files.
    • See: structured_data_legitimate.csv and structured_data_phishing.csv
  5. Split the data for training and testing or use K-fold cross-validation (K=5) as shown in the machine_learning.py script.

Implemented Machine Learning Models

  • Support Vector Machine (SVM)
  • Naive Bayes
  • Decision Tree
  • Random Forest
  • AdaBoost

Evaluate models using accuracy, precision, recall, and visualize performance.

Dataset

You can create your own dataset using the data_collector.py script with a custom URL list. I used “phishtank.org” and “tranco-list.eu” as data sources.

  • Total websites: 24,228
    • 12,001 legitimate websites
    • 12,227 phishing websites
  • Dataset was created in January 2024.
  • I may update the dataset every year.

Website

Visit the live demo at, feel free to reach me out, if site is frozen due not active users: ML Based Phishing Detection Visit my Github, if you need the source code.

This post is licensed under CC BY 4.0 by the author.