Phish Guard AI

The Build Logic

Executive Summary

PhishGuard AI is a state-of-the-art, machine-learning-powered web application that identifies email phishing attempts with 98.26% accuracy and 0.9969 ROC-AUC. Integrating advanced NLP with a tuned Logistic Regression model, it features a hybrid intelligence system combining AI model predictions with a secure heuristics rules engine to completely eliminate false positives.

The Problem

Pure machine learning models often struggle with enterprise notifications or weekly newsletters, incorrectly flagging them as phishing due to automated language. Furthermore, traditional scanners lack the concurrency to analyze live inboxes rapidly without causing significant user delay.

The Solution

PhishGuard AI implements a dual-layer cybersecurity approach. It uses a high-dimensional ML pipeline to predict threats, backed by a Reputation Heuristic Layer that actively overrides false positives from verified domains unless critical 'panic triggers' are detected. It also features a multi-threaded Gmail Active Scanner that evaluates live inboxes in real-time, providing glassmorphic threat popups and deep-links directly into the user's Gmail interface.

System Architecture

ML Inference Pipeline
Text cleansing and tokenization (stripping HTML, replacing URLs/emails).
TF-IDF Vectorization projecting text into a 40,000-dimensional space (Unigrams/Bigrams).
Balanced Logistic Regression calculating probability boundaries.
Heuristic Layer
Custom rule engine that intercepts ML scoring, capping standard notifications from verified senders at 12% (Low Risk) to prevent false alarms.
Backend & Orchestration Layer
Python/Flask handling dynamic routing and OAuth2 session state.
Concurrent scanning utilizing ThreadPoolExecutor with 20 workers for isolated API fetching.
Frontend & Visualization Layer
HTML5/CSS3 utilizing glassmorphism and Jinja2 templating.
Matplotlib generating dynamic threat gauges and signal horizontal bars, served inline as base64 PNGs.

Engineering Decisions

Why a Hybrid Rules Engine?
To strictly solve the limitation of pure ML models throwing false positives on legitimate automated emails, mimicking professional enterprise firewalls with a calibrated 70% confidence threshold.

Why ThreadPoolExecutor?
To instantiate isolated Google credentials and Gmail service clients concurrently, overcoming the network latency of sequential API calls.

Why TF-IDF over Deep Learning Text Models?
To ensure an extremely lightweight, high-speed inference pipeline capable of running locally without requiring heavy GPU overhead.

Performance Metrics

Model Accuracy: 98.26% on 18,096 samples.
ROC-AUC Score: 0.9969, indicating stellar class separation.
Scan Latency: Fetches and scans 50 live emails in under 2 seconds.
Feature Space: 40,000 log-scaled features.

Scalability Strategy

Thread-safe, isolated Google API clients allow the backend to aggressively scale up scanning workers based on server hardware capability; modular joblib model dumps enable immediate zero-downtime model upgrades.

Outcome

A highly accurate, enterprise-grade phishing detection SaaS platform that flawlessly balances mathematical NLP classification with practical, real-world heuristic whitelisting.

The Build Logic

Executive Summary

The Problem

The Solution

System Architecture

ML Inference Pipeline
Text cleansing and tokenization (stripping HTML, replacing URLs/emails).
TF-IDF Vectorization projecting text into a 40,000-dimensional space (Unigrams/Bigrams).
Balanced Logistic Regression calculating probability boundaries.
Heuristic Layer
Custom rule engine that intercepts ML scoring, capping standard notifications from verified senders at 12% (Low Risk) to prevent false alarms.
Backend & Orchestration Layer
Python/Flask handling dynamic routing and OAuth2 session state.
Concurrent scanning utilizing ThreadPoolExecutor with 20 workers for isolated API fetching.
Frontend & Visualization Layer
HTML5/CSS3 utilizing glassmorphism and Jinja2 templating.
Matplotlib generating dynamic threat gauges and signal horizontal bars, served inline as base64 PNGs.

Engineering Decisions

Why ThreadPoolExecutor?
To instantiate isolated Google credentials and Gmail service clients concurrently, overcoming the network latency of sequential API calls.

Why TF-IDF over Deep Learning Text Models?
To ensure an extremely lightweight, high-speed inference pipeline capable of running locally without requiring heavy GPU overhead.

Performance Metrics

Scalability Strategy

Outcome

A highly accurate, enterprise-grade phishing detection SaaS platform that flawlessly balances mathematical NLP classification with practical, real-world heuristic whitelisting.

Phish Guard AI

Executive Summary

Technical Context

The Build Logic

Executive Summary

The Problem

The Solution

System Architecture

Engineering Decisions

Performance Metrics

Scalability Strategy

Outcome

System Visuals

Scalability is the only standard

INITIALIZING NODE

Phish Guard AI

Executive Summary

Technical Context

The Build Logic

Executive Summary

The Problem

The Solution

System Architecture

Engineering Decisions

Performance Metrics

Scalability Strategy

Outcome

System Visuals

Scalability is the only standard