Production-Ready Phishing Detection System with Explainable AI
A complete machine learning operations (ML Ops) system demonstrating the full lifecycle from exploratory data analysis through production deployment. Built to showcase end-to-end capabilities: rigorous feature engineering, model development with isotonic calibration, multi-tier decision frameworks, SHAP explainability, and production-grade validation.
Model Performance:
- 99.87% PR-AUC - Near-perfect precision-recall tradeoff
- 0.24% False Positive Rate - Only 66 misclassifications out of 26,970 legitimate URLs
- 99.40% F1-Macro - Excellent balance across both classes
- Brier Score: 0.0052 - Well-calibrated probabilities for threshold-based decisions
System Capabilities:
- 88% High Automation Rate - High-confidence decisions handled by policy bands (52% ALLOW, 36% BLOCK)
- 12% Gray Zone - Uncertain cases escalated to judge with explainable rationale
- 7 URL-Only Features - No page fetching required (<50ms inference)
- SHAP Explainability - Feature-level attribution for regulatory compliance
- LLM Judge Integration - Ollama-powered reasoning for edge cases (npm.org, bit.ly)
- Production-Grade Validation - Great Expectations data contracts in CI/CD
Critical Design Decision:
- Eliminated IsHTTPS Feature - Chose 7-feature model over 8-feature despite 109 additional false negatives to eliminate 100% HTTPS phishing miss rate (systematic vulnerability in 8-feature model due to distribution shift)
- System Architecture
- Discovery Phase: Exploratory Data Analysis
- Feature Engineering
- Model Development
- Critical Design Decision: IsHTTPS Removal
- Threshold Optimization
- Production System Design
- LLM Judge System
- Validation & Quality Assurance
- API Reference
- Key Learnings
- Quick Start
- Documentation
┌─────────────────────────────────────────┐
│ CLIENT REQUEST │
│ (URL to analyze) │
└──────────────────┬──────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ GATEWAY SERVICE (:8000) │
│ ┌────────────────────────────────────┐ │
│ │ 1. Whitelist Check (15 domains) │ │
│ │ ├─ google.com, github.com │ │
│ │ └─ Fast-path ALLOW (O(1), <10ms)│ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ 2. Model Service Call │ │
│ │ └─ Returns p_malicious + 7 feats│ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ 3. Enhanced Routing Logic │ │
│ │ ├─ Short domain check (≤12) │ │
│ │ └─ Routes edge cases to judge │ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ 4. Policy Bands │ │
│ │ ├─ p < 0.011 → ALLOW │ │
│ │ ├─ p > 0.998 → BLOCK │ │
│ │ └─ 0.011 ≤ p ≤ 0.998 → REVIEW │ │
│ └────────────────┬───────────────────┘ │
│ │ │
│ ┌────────────────▼───────────────────┐ │
│ │ 5. Judge Escalation (Gray Zone) │ │
│ │ ├─ LLM Judge (Ollama primary) │ │
│ │ └─ Stub Judge (deterministic) │ │
│ │ (fallback if LLM timeout) │ │
│ └────────────────┬───────────────────┘ │
└──────────────────┬────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ MODEL SERVICE (:8002) │
│ ┌────────────────────────────────────┐ │
│ │ • Feature Extraction (7 features) │ │
│ │ • XGBoost Inference │ │
│ │ • Isotonic Calibration │ │
│ │ • SHAP Explainability (/explain) │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ OLLAMA LLM (:11434) │
│ • llama3.2:1b model │
│ • 60-second timeout (first call) │
│ • Provides human-readable rationale │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ RESPONSE TO CLIENT │
│ {decision, p_malicious, reason, │
│ judge_rationale, shap_values} │
└──────────────────────────────────────────┘
Decision Flow Examples:
- github.com → Whitelist → ALLOW (bypasses model, <10ms)
- phishing.top → Model (p=1.0) → Policy Band → BLOCK
- npm.org → Model (p=0.35) → Enhanced Routing → Judge → ALLOW (short domain edge case)
- example.com → Model (p=0.05) → Policy Band → ALLOW
PhiUSIIL Phishing URL Dataset (2019-2020)
- Source: Prasad & Chandra (2023)
- Size: 235,795 URLs → 235,370 after deduplication
- Class Distribution: 57.3% legitimate (134,850) | 42.7% phishing (100,945)
- Quality: Zero missing values, excellent data integrity
Duplicate Removal:
- Found: 425 duplicate URLs (0.18% of dataset)
- Action: Removed to prevent train/test leakage
- Verification: All duplicates had consistent labels
- Final Dataset: 235,370 unique URLs
Rationale: Standard ML pipeline step to ensure model learns patterns, not specific URLs.
We systematically evaluated all 54 features in the dataset to identify the optimal subset for URL-only phishing detection.
Goal: Select features extractable from URL alone (no page fetching)
Identified: 25 URL-only features across three categories:
- Structure (7): URLLength, DomainLength, IsDomainIP, TLD, TLDLength, NoOfSubDomain, IsHTTPS
- Character Patterns (15): NoOfLettersInURL, LetterRatioInURL, NoOfDigitsInURL, CharContinuationRate, URLCharProb, etc.
- Behavioral (3): Bank, Pay, Crypto (keyword presence)
Excluded: Page-content features (HTMLLineOfCode, NoOfImages, JSObfuscation) that require fetching
Methodology:
separation_score = |median_phishing - median_legitimate| / pooled_std_dev
Tier Classification:
- TIER 1 (Must-Have): separation > 1.3 → 4 features
- TIER 2 (Strong): 0.8 < separation ≤ 1.3 → 2 features
- TIER 3 (Moderate): 0.3 < separation ≤ 0.8 → 2 features
- TIER 4 (Weak): separation ≤ 0.3 → 16 features (excluded)
Top Features by Separation Score:
| Feature | Separation | Tier | Interpretation |
|---|---|---|---|
| IsHTTPS | 2.829 | 1 | Phishing uses HTTP 50.9% of time; Legitimate uses HTTPS 100% |
| TLDLegitimateProb | 2.012 | 1 | TLD reputation score (Bayesian priors) |
| CharContinuationRate | 1.372 | 1 | Character repetition (aaaa, 1111) |
| SpacialCharRatioInURL | 1.330 | 1 | Special character density |
| URLCharProb | 0.889 | 2 | Character distribution entropy |
| LetterRatioInURL | 0.825 | 2 | Letter density |
| NoOfOtherSpecialCharsInURL | 0.562 | 3 | Special character count |
| DomainLength | 0.324 | 3 | Domain name length |
Initial Selection: 8 features → Critical discovery: IsHTTPS showed distribution shift although it is the strongest single discriminator (see Critical Design Decision: IsHTTPS Removal)
!
Figure 1: Feature separation analysis showing discriminative power of each URL-only feature
High Correlation Pairs (|r| > 0.8):
- URLLength ↔ NoOfLettersInURL: 0.96 (strong redundancy)
- URLLength ↔ NoOfDigitsInURL: 0.84 (redundant)
- NoOfDigitsInURL ↔ NoOfEqualsInURL: 0.81 (query param correlation)
Action: Excluded URLLength due to redundancy with DomainLength (more specific, less correlated).
Moderate Correlations (0.6 < |r| < 0.8):
- SpacialCharRatioInURL ↔ CharContinuationRate: -0.71 (inverse relationship)
- DigitRatioInURL ↔ URLCharProb: -0.71
Action: Both included as they capture different aspects (density vs. entropy).
Figure 2: Correlation matrix showing feature redundancy patterns
TLD Analysis:
-
High-Risk TLDs: .top (99.9% phishing), .dev (98.6%), .app (97.8%)
-
Legitimate TLDs: .edu (0.3% phishing), .org (12.1% phishing)
-
Decision: Use Wilson confidence interval, a robust method for binomial proportions—to measure uncertainty, especially for TLDs with few samples.
-
For a range of minimum sample thresholds, it:
- Filters out TLDs with too few URLs (not enough data to trust their stats).
- Computes the observed legitimacy rate and its confidence interval width for each remaining TLD.
- Summarizes the average and 90th percentile uncertainty (CI width) across TLDs.
- Assigns a reliability label (HIGH, MEDIUM, LOW) based on how tight the confidence intervals are.
-
This lets us balance coverage (how many URLs/TLDs we keep) against reliability (how much we can trust our stats), and pick a defensible threshold for model training and evaluation.
CI_width ≈ 2 * sqrt(p(1-p)/n + 1/(4n²))
Where p ≈ 0.5 (worst case), n = sample size
Confidence_Interval (95%) = (𝑝−half-width, 𝑝+half-width)
Hyperparameters:
- α (pseudo-legitimate): 1
- β (pseudo-phishing): 2 (Conservative, Unknown or under-sampled TLDs are risky until proven safe.)
- MIN_SAMPLES: 20
- All single-sample TLDs now use global rate (0.574)
- Only TLDs with ≥20 samples get custom probabilities
Domain Analysis:
- Cardinality: 220,086 unique domains (1.07 URLs per domain)
- Generalization: Only 0.025% of domains appear in both classes
- Bias: Major domains (docs.google.com, s3.amazonaws.com) show 100% phishing due to dataset sampling
- Decision: EXCLUDE Domain feature (memorization risk, data leakage)
OPTIMAL 7-FEATURE SET:
- TLDLegitimateProb (separation: 2.012)
- CharContinuationRate (separation: 1.372)
- SpacialCharRatioInURL (separation: 1.330)
- URLCharProb (separation: 0.889)
- LetterRatioInURL (separation: 0.825)
- NoOfOtherSpecialCharsInURL (separation: 0.562)
- DomainLength (separation: 0.324)
Selection Criteria:
- ✅ Prioritized Tier 1-3 features (separation > 0.3)
- ✅ Excluded highly correlated pairs
- ✅ Balanced predictive power with interpretability
- ✅ All features extractable from URL string alone
Comparison to Initial Selection:
| Criteria | Initial 7 Features | Optimal 7 Features |
|---|---|---|
| Tier 1 features | 4 | 3 ✅ |
| Tier 2 features | 1 | 2 ✅ |
| Tier 3 features | 1 | 2 ✅ |
| Tier 4 features (weak) | 3 ❌ | 0 ✅ |
Improvement: Replaced 3 weak features (URLLength, DigitRatioInURL, NoOfSubDomain) with stronger discriminators.
All features are extracted using src/common/feature_extraction.py to ensure training/serving consistency.
- Type: Float [0.0, 1.0]
- Definition: Bayesian priors using Wilson Confidence Interval legitimacy probability for top-level domain
- Source:
common/tld_probs.json(695 TLDs with cpnfidence values) - Methodology: Statistically justified confidence values
Statistical Justification:
We conducted a rigorous analysis to determine optimal parameters for TLD probability estimation:
Sample Distribution Analysis:
- Total TLDs: 1,401 unique
- Median samples per TLD: 3 (50th percentile)
- 95th percentile: 249 samples
- Sample distribution:
- 34.4% of TLDs have only 1 sample (unreliable)
- 70.4% of TLDs have < 10 samples (poor statistical confidence)
Wilson Confidence Interval Analysis:
| Sample Size | CI Width | Reliability | Samples % |
|---|---|---|---|
| 5 | 0.525 | UNRELIABLE (very wide) | 99.4% |
| 10 | 0.401 | POOR (wide) | 98.9% |
| 15 | 0.349 | POOR (wide) | 98.5% |
| 20 | 0.286 | GOOD (narrow) | 98.2% |
| 30 | 0.241 | GOOD (narrow) | 97.5% |
| 50 | 0.093 | EXCELLENT (very narrow) | 96.7% |
Final Parameters (Data-Driven):
- MIN_SAMPLES = 20 (covers 98.2% of URLs, balances reliability vs coverage)
- ALPHA = 1, BETA = 2 (security-first priors: "unknown TLDs are risky until proven safe")
Methodology:
- TLDs with ≥20 samples: Smoothed Bayesian estimation with (α=1, β=2) priors
- TLDs with <20 samples: Fallback to global legitimacy rate (0.574)
Example TLD Probabilities:
- .com: 0.611 (balanced)
- .org: 0.709 (high legitimacy)
- .edu: ~0.99 (very high legitimacy)
- .tk (Tokelau): 0.019 (high phishing)
- .top: 0.002 (very high phishing)
- Type: Float [0.0, 1.0]
- Formula: (count of repeated chars) / (total chars - 1)
- Examples:
- "abc" → 0.0 (no repetition)
- "aaa" → 1.0 (all repeated)
- "google.com" → 0.176 (some repetition in "oo")
- Type: Float [0.0, 1.0]
- Definition: Density of special characters
- Special chars:
! @ # $ % ^ & * ( ) _ + - = [ ] { } | ; : , . < > ? / - Examples:
- "http://example.com" → 0.16
- "http://ex.com/login?id=123&token=abc" → 0.23
- Type: Float [0.0, 1.0]
- Definition: Proportion of common URL characters (alphanumeric +
:/.?=&-_) - Purpose: Measures how "URL-like" the character distribution is
- Examples:
- "http://example.com" → 0.95 (all common chars)
- "http://ex.com/@@##$$" → 0.70 (unusual chars)
- Type: Float [0.0, 1.0]
- Formula: (count of letters A-Za-z) / (total chars)
- Examples:
- "http://example.com" → 0.63
- "http://ex.com/123" → 0.47
- Type: Integer [0, ∞)
- Definition: Total count of special characters
- Same character set as SpacialCharRatioInURL (but returns count instead of ratio)
- Examples:
- Type: Integer [1, 253]
- Definition: Length of the domain component (netloc)
- RFC 1035 limit: 253 characters
- Examples:
One of the most critical lessons from this project was ensuring feature extraction consistency between training and production.
The Problem: Initial deployment revealed training/serving skew when we deployed with our own feature extraction. Root cause analysis revealed:
- Training notebook used PhiUSIIL's pre-computed features (black-box calculations)
- Production service extracted features using custom logic
- Small implementation differences led to vastly different feature values
The Solution:
Created a shared feature extraction library (src/common/feature_extraction.py) used by BOTH training and serving:
Training Pipeline:
notebooks/feature_engineering.ipynb
└─ Uses src/common/feature_extraction.py
└─ Generates phiusiil_features_v2.csv (8 features)
└─ notebooks/02_ablation_url_only.ipynb trains final model
Serving Pipeline:
Gateway Service → Model Service
└─ Uses src/common/feature_extraction.py (same code!)
└─ Extracts features at inference time
Validation:
- ✅ Batch extraction matches live extraction for 1000 random URLs
- ✅ Deterministic: Same URL always produces same features
- ✅ No data leakage: Training uses only raw PhiUSIIL URLs
Key Takeaway: Feature extraction is code, not notebooks. Always use shared libraries to prevent training/serving skew.
Strategy: 80/20 train/validation split with stratification
- Training Set: 188,296 URLs (80%)
- Validation Set: 47,074 URLs (20%)
- Stratification: Maintains class balance in both sets (57.3% legitimate, 42.7% phishing)
- Random Seed: 42 (reproducibility)
Why no separate test set?
- Dataset size (235K URLs) makes 3-way split inefficient
- Validation fold serves dual purpose: calibration + final evaluation
- Cross-validation used during model selection (not reported here)
Two candidates evaluated:
- Preprocessing: StandardScaler for feature normalization
- Class Balancing:
class_weight="balanced"to handle slight imbalance - Regularization: L2 penalty (C=1.0)
- Purpose: Establish interpretable baseline
- Algorithm: Gradient-boosted decision trees
- Hyperparameters:
n_estimators: 100max_depth: 6 (shallow trees to prevent overfitting)learning_rate: 0.1subsample: 0.8colsample_bytree: 0.8
- Purpose: Achieve maximum performance while maintaining interpretability
Why Calibration Matters:
Raw model outputs (e.g., predict_proba()) are not always well-calibrated—a predicted probability of 0.8 might not correspond to 80% empirical likelihood. For threshold-based decision systems, calibration is critical.
Methodology:
from sklearn.calibration import CalibratedClassifierCV
# Wrap XGBoost with isotonic calibration (5-fold CV)
calibrated_model = CalibratedClassifierCV(
base_estimator=xgb_model,
method='isotonic', # Monotonic calibration
cv=5 # 5-fold cross-validation
)Isotonic Regression: Fits a piecewise-constant, monotonically increasing function to map raw scores to calibrated probabilities.
Validation: Brier score of 0.0052 (near-perfect calibration, where 0.0 = perfect, 0.25 = random).
During validation, the 8-feature model (including IsHTTPS) exhibited 100% false negative concentration on HTTPS phishing at its optimal threshold.
Training Data Distribution:
- Phishing URLs: 6% HTTPS, 94% HTTP
- Legitimate URLs: 97% HTTPS, 3% HTTP
Result: Model learned IsHTTPS=1 as a strong legitimacy signal, creating a systematic blind spot for modern HTTPS phishing attacks.
| Model | Threshold | Total FNs | HTTPS FNs | HTTP FNs | Error Pattern |
|---|---|---|---|---|---|
| 7-feature (no IsHTTPS) | 0.50 | 210 | 93 (44%) | 117 (56%) | RANDOM |
| 8-feature (with IsHTTPS) | 0.36 | 101 | 101 (100%) | 0 (0%) | SYSTEMATIC |
We chose the 7-feature model because:
-
Error Distribution > Error Count
- 8-feature: 100% HTTPS concentration = systematic vulnerability
- 7-feature: 44% HTTPS distribution = no blind spot
-
Modern Threat Landscape
- Dataset (2019-2020): 6% HTTPS phishing
- Reality (2025): ~75% HTTPS phishing (Let's Encrypt era)
- 8-feature: 100% miss rate on dominant attack vector
-
Production Safety
- 7-feature: Standard threshold (0.5), predictable behavior
- 8-feature: Non-standard threshold (0.36), prone to misconfiguration
Trade-off Accepted:
- Cost: +109 FNs = +0.54% miss rate
- Benefit: Eliminate 100% HTTPS vulnerability
- Net Result: Safer, more robust production system
HTTPS Breakdown on validation:
- Total FNs: 210
- HTTPS FNs: 93 (44.3%) ← Random distribution ✅
- HTTP FNs: 117 (55.7%)
| Metric | 8-Feature | 7-Feature | Delta |
|---|---|---|---|
| PR-AUC | 0.9992 | 0.9987 | -0.0005 |
| F1-macro | 0.9940 | 0.9940 | 0.0000 |
| HTTPS FN Rate | 100% | 44% | -56% ✅ |
Conclusion: Distribution shift requires feature removal, not just threshold tuning.
Validation Set Metrics (47,074 URLs):
| Metric | Value | Interpretation |
|---|---|---|
| PR-AUC | 99.87% | Near-perfect precision-recall tradeoff |
| F1-Macro | 99.40% | Excellent balance across both classes |
| Brier Score | 0.0052 | Well-calibrated probabilities |
| False Positive Rate | 0.09% | 23 out of 26,970 legitimate URLs misclassified |
| False Negative Rate | 0.12% | 24 out of 20,104 phishing URLs misclassified |
Prediction Confidence Distribution:
- Extreme Phishing (p ≥ 0.998): 36.0% (16,909 samples) of validation set
- Extreme Legitimate (p ≤ 0.011): 52.0% (24,412 samples) of validation set
- Uncertain (0.011 < p < 0.998): Only 12.0% (5,632 samples) of validation set
Interpretation: Model is highly confident in its predictions, with minimal uncertainty.
Saved for Production:
- Model File:
models/dev/model_7feat.pkl(XGBoost + isotonic calibration) - Metadata:
models/dev/model_7feat_meta.json(feature order, class mapping, performance metrics) - Training Notebook:
notebooks/02_ablation_url_only.ipynb(source of truth)
Rather than using a single binary threshold, PhishGuardAI implements a three-tier decision framework:
- ALLOW: High-confidence legitimate (auto-allow without review)
- REVIEW: Gray zone (escalate to judge for manual/explainable decision)
- BLOCK: High-confidence phishing (auto-block without review)
Business Rationale:
- Automate clear cases (reduce operational load)
- Flag uncertain cases for review (prevent false positives/negatives)
- Provide explainable rationale for borderline decisions (compliance, trust)
Step 1: Optimal Decision Threshold (t_star)
Using F1-macro optimization on validation data:
- Optimal Threshold: t_star = 0.50
- F1-Macro at t_star: 0.9940
Step 2: Gray-Zone Bands
Define low and high thresholds to create a REVIEW zone:
- Low Threshold: 0.0011 (if p < 0.0011, auto-ALLOW) ---> Allow-rate = 52%
- High Threshold: 0.994 (if p > 0.994, auto-BLOCK) ---> Block-rate = 36%
- Gray Zone: 0.0011 ≤ p < 0.994 → Escalate to REVIEW ---> Review-rate = 12%
- 88% automation (ALLOW + BLOCK without human review)
- 12% gray zone (flagged for review)
Figure 4: Precision-Recall curve with optimal threshold (t_star=0.50) and gray-zone bands
Validation Set Breakdown:
| Decision | Count | Percentage | Notes |
|---|---|---|---|
| ALLOW | 24,412 | 52% | p < 0.0011 (high-confidence legitimate) |
| REVIEW | 5,632 | 10.9% | 0.0011 ≤ p < 0.994 (gray zone, judge escalation) |
| BLOCK | 16,909 | 36.0% | p ≥ 0.994 (high-confidence phishing) |
Key Insight: 88% of decisions are automated with high confidence, while only 12% require manual review or judge intervention.
PhishGuardAI consists of three microservices:
-
Gateway Service (
:8080 externally, :8000 internally)- Whitelist checking (fast-path for known domains)
- Policy band decisions (threshold-based automation)
- Enhanced routing logic (short domain handling)
- Judge orchestration (gray zone escalation)
- Metrics collection (
/statsendpoint)
-
Model Service (
:8002)- Feature extraction (
src/common/feature_extraction.py) - XGBoost inference (8-feature model)
- Isotonic calibration (probability adjustment)
- SHAP explainability (
/predict/explainendpoint)
- Feature extraction (
-
Judge Service
- Stub Judge (deterministic rule-based logic)
- LLM Judge (Ollama adapter with fallback)
- Explainable rationale generation
The Problem:
During initial deployment, well-known legitimate domains (e.g., google.com, github.com, microsoft.com) were being misclassified as phishing with high confidence (p > 0.95).
Root Cause Analysis:
- Distribution Shift: PhiUSIIL training data (2019-2020) lacks representation of major tech companies
- Short Domain Issue: google.com (10 chars) is significantly shorter than training data average (21.5 chars)
- URLCharProb Outlier: google.com's URLCharProb (1.0) is 4073 standard deviations from training mean (0.06)
- TLD Bias: While .com is common, google.com's simplicity makes it out-of-distribution
Debugging Evidence:
URL: https://google.com
Predicted: p_malicious = 0.98 (WRONG!)
Feature Analysis:
- DomainLength: 10 (training avg: 21.5, std: 12.3) → 0.93 std devs below mean
- URLCharProb: 1.0 (training avg: 0.06, std: 0.0002) → 4073 std devs above mean
- TLDLegitimateProb: 0.612 (legitimate avg: 0.709) → Slightly below average
Conclusion: Model never saw short, simple legitimate domains in training.
The Solution: Intelligent Whitelisting
Added a pre-model whitelist check for known legitimate domains:
KNOWN_LEGITIMATE_DOMAINS = {
"google.com", "www.google.com",
"github.com", "www.github.com",
"microsoft.com", "www.microsoft.com",
"amazon.com", "apple.com", "facebook.com",
"twitter.com", "linkedin.com", "youtube.com",
"wikipedia.org", "stackoverflow.com",
"netflix.com", "paypal.com"
}Whitelist Logic:
- Fast-path: O(1) set lookup, <10ms response
- Override: Returns
p_malicious = 0.01(bypasses model entirely) - Reason:
"domain-whitelist"(inherently explainable)
When to Use Whitelist:
- Out-of-distribution domains (major tech companies)
- Edge cases where model has known blind spots
- High-value domains where false positives are unacceptable
The Problem:
Even with the whitelist, other short legitimate domains (e.g., npm.org, bit.ly, t.co) were still being flagged as suspicious.
Statistical Insight:
- Training data: Average domain length = 21.5 chars
- Short domains: ≤10 chars are rare in training (only 5.3% of legitimate URLs)
- Model bias: Learned to associate short domains with phishing (many short phishing domains in training)
The Solution: Enhanced Routing Logic
Implemented intelligent routing for short domain edge cases:
def _should_route_to_judge_for_short_domain(url: str, p_malicious: float) -> bool:
"""
Route short domains to judge even if model is moderately confident.
Rationale: Short legitimate domains (npm.org, bit.ly) are underrepresented
in training data and may appear suspicious to the model.
"""
domain = urlparse(url).netloc.lower()
domain_no_www = domain.replace("www.", "")
return (
len(domain_no_www) <= 12 and # Short domain
p_malicious < 0.5 # Moderate confidence (not extreme)
)Decision Flow:
- Standard Gray Zone:
0.0011 ≤ p < 0.994→ Judge with standard context - Short Domain Gray Zone:
len(domain) ≤ 12 AND p < 0.5→ Judge with short domain context
Example:
URL: https://npm.org
Model prediction: p_malicious = 0.35 (moderate)
Standard routing: REVIEW (gray zone)
Enhanced routing: SHORT_DOMAIN_REVIEW (judge with context)
Judge verdict: LEAN_LEGIT (no obvious phishing heuristics)
Rationale: "Short domain, but no suspicious patterns. Likely legitimate URL shortener or package registry."
Final decision: ALLOWBenefits:
- Prevents false positives on legitimate shorteners (bit.ly, t.co, npm.org)
- Provides explainable rationale for security teams
- Gracefully handles edge cases without retraining model
Stub Judge (Default)
- Type: Deterministic rule-based heuristics
- Logic:
- Pattern matching: Does URL contain "login", "verify", "account"?
- Feature thresholds: Is TLD legitimacy < 0.2? Is char continuation > 0.5?
- Domain analysis: Is domain unusually long or short?
- Advantages:
- Fast (<10ms)
- Deterministic (reproducible)
- No dependencies (never fails)
- Explainable (clear rule-based rationale)
LLM Judge (Optional)
- Type: Ollama-based LLM (llama3.2:1b)
- Logic: Natural language reasoning with URL context
- Advantages:
- Richer explanations
- Better edge case handling
- Human-readable rationale
- Fallback: Automatically falls back to stub judge if Ollama unavailable
Judge Verdicts:
- LEAN_LEGIT: Likely legitimate → Final decision: ALLOW
- LEAN_PHISH: Likely phishing → Final decision: BLOCK
- UNCERTAIN: Unclear → Final decision: REVIEW (manual review queue)
# Check if Ollama service is running
ollama list
# Should show: llama3.2:1b
# Test Ollama directly
ollama run llama3.2:1b
# Try: "Is google.com a phishing URL?"
# If it responds, Ollama is working!
# Type /bye to exit
# Check Ollama API endpoint
curl http://localhost:11434/api/tags
# Should return JSON with model list
- If Ollama isn't running:
# Start Ollama service
ollama serve
# Keep this running in a terminal
# .env file
# Enable LLM judge (instead of stub)
JUDGE_BACKEND=llm
# Ollama API endpoint (default is localhost:11434)
OLLAMA_BASE_URL=http://localhost:11434
# Model name (must match your installed model)
OLLAMA_MODEL=llama3.2:1b
# Judge timeout (seconds)
JUDGE_TIMEOUT=60
# Verbose logging (to see judge calls)
LOG_LEVEL=DEBUG
- Windows:
set JUDGE_BACKEND=llm
set OLLAMA_BASE_URL=http://localhost:11434
set OLLAMA_MODEL=llama3.2:1b
set LOG_LEVEL=DEBUG
- Linux/Mac:
export JUDGE_BACKEND=llm
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_MODEL=llama3.2:1b
export LOG_LEVEL=DEBUG
Purpose: Provide feature-level explanations for model predictions to support:
- Regulatory compliance (GDPR right to explanation)
- Security team trust (understand why URL was flagged)
- Model debugging (identify biases or unexpected patterns)
- Merchant appeals (provide evidence for overturning decisions)
Implementation:
- Endpoint:
/predict/explain(Model Service) - Method: TreeExplainer (fast, exact SHAP values for tree models)
- Fallback: KernelExplainer (model-agnostic, slower but more general)
- Output: JSON with SHAP values for each feature
SHAP Dashboard:
- Access:
http://localhost:8000/explain(served by Gateway) - Features:
- Input URL field
- Prediction display (probability + risk level)
- Feature contribution bar chart (red = increases risk, green = decreases risk)
- Feature values table (raw extractions)
Example SHAP Output:
{
"url": "http://facebook1mob.com",
"p_malicious": 1.0,
"source": "model",
"feature_contributions": {
"NoOfOtherSpecialCharsInURL": 1.7578, // Typosquatting ('1' in domain)
"DomainLength": -1.8956, // Moderate length slightly protective
"CharContinuationRate": 1.0135,
"TLDLegitimateProb": 0.0271,
"URLCharProb": 0.0000,
"SpacialCharRatioInURL": -0.1397,
"LetterRatioInURL": -0.1065
},
"feature_values": {
"DomainLength": 20.0,
"NoOfOtherSpecialCharsInURL": 5.0,
"CharContinuationRate": 0.1923,
"TLDLegitimateProb": 0.6120,
"URLCharProb": 1.0,
"SpacialCharRatioInURL": 0.1852,
"LetterRatioInURL": 0.7778
}
}Interpretation:
- NoOfOtherSpecialCharsInURL = 5 (the '1' in "facebook1mob") has SHAP value of +1.76 → Indicates typosquatting
- DomainLength = 20 has SHAP value of -1.90 → Moderate length is slightly protective
Two-Tier Explainability:
- Whitelist decisions: Inherently explainable ("It's on the known legitimate domain list")
- Model decisions: SHAP feature attributions ("IsHTTPS missing increased risk by 11.8 units")
Performance Note: SHAP computation adds latency (~50-100ms for TreeExplainer). Current implementation includes SHAP in the model service, resulting in a documented performance bottleneck. Future optimization work includes:
- Lazy SHAP computation (only when
/explainendpoint is called) - SHAP caching for repeated URLs
- Asynchronous SHAP computation for non-critical paths
For detailed SHAP usage and interpretation, see EXPLAINABILITY.md.
Purpose: Validate that feature extraction produces valid, model-ready data.
What Gets Validated:
- Input: Processed feature CSV (
phiusiil_features_v2.csv) with 7 engineered features - NOT validated: Raw PhiUSIIL CSV (no expectations on raw data)
Validation Checks:
-
Column Presence: All 7 features must exist (
IsHTTPS,TLDLegitimateProb,CharContinuationRate, etc.) -
Data Types:
- Probability features (TLDLegitimateProb, URLCharProb, etc.):
float64in range [0.0, 1.0] - Count features (NoOfOtherSpecialCharsInURL):
int64or integer-like floats - Length features (DomainLength):
int64with range [1, 253] (RFC 1035)
- Probability features (TLDLegitimateProb, URLCharProb, etc.):
- Probabilities: 0.0 ≤ value ≤ 1.0
- Domain Length: 1 ≤ value ≤ 253 (RFC 1035 DNS limit)
- Special Char Count: 0 ≤ value ≤ 1000 (reasonable upper bound)
- Ratios: 0.0 ≤ value ≤ 1.0
- TLDLegitimateProb: Mean should be in range [0.4, 0.8] (balanced TLD mix)
- CharContinuationRate: Mean should be in range [0.0, 0.3] (low repetition)
- No extreme outliers: Values within 5 standard deviations of mean
- No null values: Zero missing data
- Unique URLs: No duplicate URLs in dataset
- Label validity: Label column contains only 0 (phishing) or 1 (legitimate)
CI/CD Integration:
Great Expectations runs automatically via GitHub Actions:
- Workflow:
.github/workflows/data-contract.yml - Triggers:
- All pull requests
- Pushes to
devandmainbranches
- Action: Runs
scripts/ge_check.pywhich validates feature dataset - Result: Pipeline fails if any expectations are violated
Benefits:
- Prevents bad data: Catches feature extraction bugs before training
- Ensures consistency: Validates feature distributions match expected ranges
- Documents assumptions: Expectations serve as living documentation
- CI/CD guardrails: Automated checks prevent invalid data from merging
Example Expectation Suite:
# Sample expectations from ge_build_phiusiil_suite.py
suite.expect_column_values_to_be_between("TLDLegitimateProb", min_value=0.0, max_value=1.0)
suite.expect_column_values_to_be_between("DomainLength", min_value=1, max_value=253)
suite.expect_column_mean_to_be_between("IsHTTPS", min_value=0.6, max_value=0.95)
suite.expect_column_to_exist("CharContinuationRate")Purpose: Package the entire system for reproducible deployment.
Multi-Stage Build:
# gateway.Dockerfile
# ---- build stage: install runtime deps into a venv ----
FROM python:3.11-slim AS builder
WORKDIR /app
ENV PIP_NO_CACHE_DIR=1
# minimal build tools only in builder
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
&& rm -rf /var/lib/apt/lists/*
# copy metadata & install the project (runtime deps only; no [dev])
COPY pyproject.toml Readme.md ./
# Rename Readme.md to README.md for pyproject.toml compatibility
RUN mv Readme.md README.md
# Copy source code needed for installation
COPY src ./src
RUN python -m venv /opt/venv \
&& /opt/venv/bin/pip install --upgrade pip \
&& /opt/venv/bin/pip install .
# ---- runtime stage: tiny final image ----
FROM python:3.11-slim
ENV PATH="/opt/venv/bin:$PATH" \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
WORKDIR /app
# copy installed env and only the code/configs needed at runtime
COPY --from=builder /opt/venv /opt/venv
COPY src ./src
COPY configs ./configs
COPY data/tld_probs.json ./data/tld_probs.json
EXPOSE 8000
CMD ["uvicorn", "gateway.main:app", "--host", "0.0.0.0", "--port", "8000"]Benefits:
- Reproducibility: Same environment everywhere (dev, staging, prod)
- Portability: Runs on any Docker-compatible platform
- Isolation: No dependency conflicts with host system
- Efficiency: Multi-stage build reduces image size
Deployment:
# Build gateway service
docker build -f docker/gateway.Dockerfile -t phishguard-gateway:latest .
# Build model service
docker build -f docker/model.Dockerfile -t phishguard-model:latest .
# Run model service first
docker run -d --name phishguard-model -p 8002:8002 phishguard-model:latest
# Run gateway service
docker run --rm -p 8080:8000 \
-e MODEL_SVC_URL=http://host.docker.internal:8002 \
-e THRESHOLDS_JSON=configs/dev/thresholds.json \
-e JUDGE_BACKEND=stub \
phishguard-gateway:latestThree Layers of Testing:
- Scope: Individual functions (feature extraction, threshold logic, judge verdicts)
- Coverage: Feature extraction functions, policy band calculations, SHAP wrappers
- Framework: pytest
- Scope: Service interactions (Gateway → Model Service → Judge)
- Coverage: End-to-end request flows, fallback mechanisms, error handling
- Example:
tests/test_enhanced_routing.pyvalidates short domain routing logic
- Scope: Complete system (all services running)
- Coverage: Real URLs tested against live services
- Manual Test Cases:
- Whitelist:
curl http://localhost:8000/predict -d '{"url":"https://github.com"}' - Phishing:
curl http://localhost:8000/predict -d '{"url":"https://phishing.top"}' - Short domain:
curl http://localhost:8000/predict -d '{"url":"https://npm.org"}'
- Whitelist:
Test Scenarios:
- ✅ High confidence cases (no judge invocation)
- ✅ Gray zone standard routing
- ✅ Short domain edge cases
- ✅ Boundary conditions (len=10, p=0.5)
- ✅ Error handling (malformed URLs, missing features)
Lesson: Even a near-perfect model (99.92% PR-AUC) can fail on out-of-distribution data.
What Happened:
- Model trained on PhiUSIIL (2019-2020) never saw major tech companies
google.com,github.commisclassified as phishing due to short domain + simple character distribution- Root Cause: Training data focused on obscure legitimate sites and phishing URLs, not major brands
Solution:
- Whitelist strategy for known legitimate domains (industry standard)
- Enhanced routing logic for short domain edge cases (domain knowledge encoded in decision flow)
- Not a model failure: This is expected behavior for out-of-distribution inputs
Business Implication:
- Machine learning models are pattern recognizers, not oracles
- Domain knowledge and business logic are essential complements to ML
- Production systems need layered defenses (whitelist + model + judge)
Lesson: Feature extraction bugs are silent killers of ML systems.
What Happened:
- Initial deployment:
example.compredicted as 100% phishing (clearly wrong) - Root cause: Training used PhiUSIIL's pre-computed features; production used custom extraction logic
- Small implementation differences (how to count special chars, calculate ratios) led to vastly different feature values
Solution:
- Created shared feature extraction library (
src/common/feature_extraction.py) - Rewrote training pipeline to use shared library (no more black-box PhiUSIIL features)
- Validated: Batch extraction (training) matches live extraction (serving) for 1000 random URLs
Key Takeaway:
- Feature extraction is code, not notebooks
- Always use shared libraries to ensure consistency
- Test end-to-end: Same URL should produce same features in training and serving
Lesson: Hand-wavy hyperparameters lead to overconfident predictions.
What Happened:
- Initial TLD probabilities had 911 TLDs with extreme values (0.0 or 1.0)
- 70% of TLDs had < 10 samples (unreliable estimates)
- Model overconfident on rare TLDs
Solution:
- Conducted rigorous Wilson confidence interval analysis
- Set MIN_SAMPLES = 10 based on statistical justification (CI width ≤ 0.3)
- Used Bayesian priors (α=1, β=2) for security-first bias
- Result: Zero TLDs with extreme probabilities
Business Justification:
- Better to classify rare TLDs as "risky" than miss phishing (false negative is worse than false positive)
- Security-first approach: "Unknown TLDs are risky until proven safe"
Lesson: Black-box models are unacceptable for production fraud systems.
Why Explainability Matters:
- Regulatory compliance: GDPR right to explanation, PCI-DSS audit trails
- Security team trust: Teams won't use models they don't understand
- Debugging: SHAP reveals unexpected model behaviors (e.g., IsHTTPS sometimes increases risk)
- Merchant appeals: Users deserve to know why their URL was blocked
Implementation:
- SHAP dashboard provides feature-level attributions
- Judge rationale gives human-readable explanations for gray-zone decisions
- Two-tier explainability: Whitelist (inherently explainable) + SHAP (model decisions)
Performance Trade-Off:
- SHAP computation adds latency ~200-500ms for
/explainendpoint - Documented as known bottleneck with mitigation plan (lazy computation, caching)
- Acceptable for on-demand explanations; not suitable for real-time scanning
Lesson: A 99.87% PR-AUC model is useless if it crashes, has unpredictable latency, or produces unexplainable decisions.
What Production ML Requires:
- ✅ Graceful degradation: Fallback mechanisms (stub judge, heuristic scoring)
- ✅ Fail-secure design: Never return 500 errors (judge falls back to stub, model falls back to heuristic)
- ✅ Observability: Metrics (
/stats), health checks (/health), structured logging - ✅ Data contracts: Great Expectations validates feature pipeline
- ✅ CI/CD: Automated testing and data validation
- ✅ Explainability: SHAP + judge rationale for compliance
Operational Maturity > Perfect Model:
- 99.87% PR-AUC with 2s latency → Not production-ready
- 99.87% PR-AUC with 50ms latency + observability → Production-ready
- Python 3.11+
- pip
# Clone repository
git clone https://github.com/yourusername/PhishGuardAI.git
cd PhishGuardAI
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]"Option 1: Using Docker Compose (Recommended)
# Start all services
docker compose up -d
# Check service health
docker ps
# View logs
docker compose logs -f gateway
docker compose logs -f model-svcOption 2: Running Locally (Development)
Terminal 1: Model Service
python -m src.model_svc.main
# Wait for: ✓ Model Service Ready
# Listening on http://localhost:8002Terminal 2: Gateway Service
# Windows
set MODEL_SVC_URL=http://localhost:8002
# Linux/Mac
export MODEL_SVC_URL=http://localhost:8002
python -m src.gateway.main
# Listening on http://localhost:8000When using Docker (recommended):
1. Whitelist Test
curl -X POST http://localhost:8080/predict ^
-H "Content-Type: application/json" ^
-d "{\"url\":\"https://github.com\"}"
# → {"decision":"ALLOW","reason":"domain-whitelist","source":"whitelist"}2. Phishing Detection
curl -X POST http://localhost:8080/predict ^
-H "Content-Type: application/json" ^
-d "{\"url\":\"https://phishing.top\"}"
# → {"p_malicious":1.0,"decision":"BLOCK","reason":"policy-band"}3. Short Domain Routing
curl -X POST http://localhost:8080/predict ^
-H "Content-Type: application/json" ^
-d "{\"url\":\"https://npm.org\"}"
# → {"decision":"ALLOW","reason":"judge-short-domain-lean-legit"}4. SHAP Dashboard
Open browser: http://localhost:8080/explain
5. Stats Monitoring
curl http://localhost:8080/stats
# → {"policy_decisions":{...},"final_decisions":{...},"judge_verdicts":{...}}# Build gateway image
docker build -f docker/gateway.Dockerfile -t phishguard-gateway:latest .
# Build model service image
docker build -f docker/model.Dockerfile -t phishguard-model:latest .
# Run model service first
docker run -d --name phishguard-model -p 8002:8002 phishguard-model:latest
# Run gateway service
docker run --rm -p 8080:8000 \
-e MODEL_SVC_URL=http://host.docker.internal:8002 \
-e THRESHOLDS_JSON=configs/dev/thresholds.json \
-e JUDGE_BACKEND=stub \
phishguard-gateway:latest# Build and run all services
docker compose up -d
# Or build and run specific service
docker compose up gateway -dPhishGuardAI/
├─ notebooks/ # Complete ML lifecycle
│ ├─ 00_eda.ipynb # Exploratory data analysis (feature selection)
│ ├─ feature_engineering.ipynb # Feature extraction (FIXED training/serving skew)
│ ├─ 01_baseline_and_calibration.ipynb # Baseline models + isotonic calibration
│ └─ 02_ablation_url_only.ipynb # ⭐ SOURCE OF TRUTH (final 8-feature model)
├─ src/
│ ├─ gateway/ # FastAPI gateway (policy bands, routing, stats)
│ │ ├─ main.py
│ │ └─ judge_wire.py
│ ├─ model_svc/ # FastAPI model service (XGBoost + SHAP)
│ │ └─ main.py # Includes /predict and /predict/explain
│ ├─ judge_svc/ # Judge decision logic
│ │ ├─ stub.py # Deterministic rule-based judge
│ │ └─ adapter.py # LLM adapter (Ollama) with stub fallback
│ └─ common/ # Shared libraries (CRITICAL for consistency)
│ ├─ feature_extraction.py # ⭐ Shared feature library (training + serving)
│ ├─ tld_probs.json # TLD Bayesian priors (MIN_SAMPLES=10)
│ ├─ thresholds.py # Threshold loading utilities
│ ├─ stats.py # In-process metrics collection
│ └─ audit.py # Optional MongoDB audit logging
├─ models/dev/
│ ├─ model_8feat.pkl # Production model (8 features, isotonic calibration)
│ └─ model_8feat_meta.json # Metadata (feature order, class mapping, metrics)
├─ configs/dev/
│ ├─ thresholds.json # Policy band thresholds (low=0.004, high=0.999)
│ └─ config.yaml # Model service configuration
├─ scripts/
│ ├─ ge_check.py # Great Expectations validation (CI/CD)
│ └─ ge_build_phiusiil_suite.py # Build GE expectation suite
├─ static/ # SHAP dashboard HTML/CSS/JS
│ └─ explain.html
├─ tests/
│ └─ test_enhanced_routing.py # Enhanced routing tests
├─ docker/
│ └─ gateway.Dockerfile # Multi-stage Docker build
├─ .github/workflows/
│ ├─ ci.yml # Tests + Docker build
│ └─ data-contract.yml # Great Expectations CI
├─ outputs/eda/ # EDA artifacts (visualizations, CSVs)
│ ├─ all_url_only_features_distribution.png
│ ├─ url_features_correlation_heatmap.png
│ ├─ feature_separation_scores.csv
│ └─ eda_summary.json
├─ docs/ # Comprehensive documentation
│ ├─ INTERVIEW_PREP.md # Interview cheat sheet
│ ├─ EXPLAINABILITY.md # SHAP dashboard guide
│ ├─ MODEL_CARD.md # Model documentation
│ └─ DEMO_SCRIPT.md # 5-minute walkthrough
├─ README.md # This file
└─ .env.example # Environment variables template
- EXPLAINABILITY.md - SHAP dashboard usage and interpretation
- MODEL_CARD.md - Industry-standard model documentation (Google/HuggingFace format)
- API.md - Complete endpoint reference, Request/response schemas, Code examples (curl, Python, Bash), Example workflows
- ARCHITECTURE.md - IsHTTPS distribution shift deep dive, Feature engineering philosophy, Policy bands & gray zone design, LLM judge integration rationale
- DEPLOYMENT.md - Local development setup, Docker deployment, Ollama LLM judge installation
- JUDGE.md - LM judge architecture, Routing logic explained, Prompt engineering details, Performance characteristics, LLM vs Stub comparison
-
URL-Only Scope
- Doesn't analyze page content (HTML, images, JavaScript)
- Mitigation: Add optional page content features for high-risk cases
-
Static Whitelist
- Requires manual updates for new domains
- Mitigation: Automate with Alexa Top 1000 API or Cloudflare Radar domain reputation
-
No Real-Time Drift Detection
- Can't detect feature distribution shifts in production
- Mitigation: Implement PSI (Population Stability Index) monitoring + retraining triggers
-
Performance Bottleneck (Documented)
- SHAP explainability adds ~200-500ms latency to
/predict/explainendpoint - Investigation Underway: Profiling model service to identify optimization opportunities
- Candidate Solutions:
- Lazy SHAP computation (only when
/explainendpoint called, not on every prediction) - SHAP caching layer (cache explanations for frequently-seen URLs)
- Asynchronous SHAP computation (return prediction immediately, compute SHAP in background)
- Model optimization (quantization, smaller SHAP background dataset)
- Lazy SHAP computation (only when
- Status: Performance profiling planned; initial optimization work scheduled for next sprint
- Acceptable Trade-Off: Current latency is acceptable for on-demand explanations (not real-time scanning)
- SHAP explainability adds ~200-500ms latency to
-
Static Thresholds
- Policy bands don't adapt to fraud rate changes
- Mitigation: Implement dynamic threshold tuning based on operational capacity
-
LLM First-Call Latency - 15-20s model loading
Short-term (1-2 weeks):
- Performance profiling and optimization (address SHAP latency)
- Prometheus metrics + Grafana dashboards
- Structured logging with request IDs
Medium-term (1-3 months):
- Feature drift detection (PSI monitoring + alerts)
- Automated retraining pipeline (weekly Airflow job with last 6 months of data)
- Security hardening (rate limiting, JWT authentication, API keys)
- A/B testing framework (shadow mode for new models)
Long-term (3-6 months):
- Page content features (HTML analysis, screenshot detection)
- Active learning pipeline (identify uncertain predictions, request labels)
- Champion/challenger framework (automatic model promotion if metrics improve)
- Multi-model ensemble (URL + page content + user behavior signals)
PhiUSIIL Phishing URL Dataset
- Authors: Prasad, A., & Chandra, S. (2023)
- Title: PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning
- Journal: Computers & Security, 103545
- DOI: 10.1016/j.cose.2023.103545
- Size: 235,370 unique URLs (after deduplication)
- Collection Period: 2019-2020
- Known Gap: Major tech companies (Google, GitHub, Microsoft) excluded from training data
# All tests
pytest -v
# Data contract validation
python scripts/ge_check.py
# CI locally
pre-commit run --all-filesPull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
What This Project Demonstrates:
✅ Complete ML Lifecycle: EDA → Feature Engineering → Model Development → Threshold Optimization → Production Deployment
✅ Statistical Rigor: Justified hyperparameters (MIN_SAMPLES=10, Bayesian priors) with confidence interval analysis
✅ Production ML Ops: Training/serving consistency, graceful degradation, observability, data contracts
✅ System Design Thinking: Multi-tier decisions, enhanced routing, whitelist strategy
✅ Explainability: SHAP dashboard + judge rationale for regulatory compliance
✅ Operational Maturity: Great Expectations in CI/CD, Docker containerization, comprehensive testing
Most Valuable Lessons:
- Training/serving skew is real - Feature extraction must use shared libraries
- Distribution shift happens - Whitelists and enhanced routing handle edge cases
- Statistical rigor matters - Data-driven hyperparameter justification prevents overconfidence
- Explainability is essential - SHAP + judge rationale build trust and enable compliance
- Production ML is about reliability - Graceful degradation, observability, and data validation matter as much as accuracy
Fitsum Gebrezghiabihier
Email: [email protected]
LinkedIn: linkedin.com/in/fitsumblb
GitHub: github.com/fitsblb
Project Link: https://github.com/fitsblb/PhishGuardAI
- PhiUSIIL dataset authors (Prasad & Chandra) for high-quality labeled data
- FastAPI for exceptional API framework
- SHAP for interpretable ML explanations
- Great Expectations for data quality validation
- scikit-learn and XGBoost communities
Built with ❤️ to demonstrate end-to-end ML Ops capabilities from discovery to production deployment.