A multi-layered open-science dataset for phishing, spam, and legitimate email analysis using emotional, motivational, and semantic labels.
This repository contains a new, richly annotated dataset designed for research on LLM-based email security, including phishing detection, spam analysis, emotional manipulation, and automated robustness evaluation under paraphrasing.
The dataset includes:
- Human-written phishing, spam, and legitimate emails
- LLM-generated emails (GPT-4o, DeepSeek-Chat, Grok, Llama 3.3, Gemini, Nova, Mistral, etc.)
- Emotion and motivation labels
- Rephrased/paraphrased variants from three independent LLM pipelines
- Claude 3.5 Sonnet classifications
This repository enables reproducible research on how LLMs interpret, classify, and analyze deceptive online communication.
merged_emails_with_categories.jsonl
Contains:
- True category (Phishing, Spam, Valid)
- Human vs. LLM-generated origin
- Rephrasing source (GPT-4o, DeepSeek, RandomAPI, Manual)
- Emotional labels (urgency, fear, authority, etc.)
- Motivational labels (link-click, credential theft, etc.)
- Claude 3.5 Sonnet predicted classification
accuracy_validation.py— Benchmark emotional & motivational detectioncategory.py— Email classification pipelinestats.py— Strict/relaxed accuracy, confusion matrices, paraphrase robustness
- Confusion matrices
- Strict and relaxed classification reports
- Emotional/motivational LLM benchmarking
- Robustness metrics across rephrasing pipelines
Phishing and spam emails remain pervasive cybersecurity threats, increasingly strengthened by the use of Large Language Models (LLMs) to generate deceptive content. This work introduces a comprehensive, multi-layered email dataset containing both human-written and LLM-generated messages across phishing, spam, and legitimate categories. Each email is enriched with emotional and motivational labels—capturing cues such as urgency, fear, authority, greed, and link-click incentives—along with paraphrased variants generated by multiple LLM pipelines to test classifier robustness.
We benchmark several modern LLMs for emotional and motivational detection and identify Claude 3.5 Sonnet as the most reliable model for large-scale annotation. We further evaluate its classification accuracy under both strict (three-class) and relaxed (unwanted vs. valid) settings across original and LLM-rewritten emails. Results show that contemporary LLMs can reliably detect harmful messages and emotional manipulation strategies, though distinguishing spam from legitimate emails remains difficult.
All templates, datasets, and source code are released openly to support reproducible research in AI-assisted email security.
- Human-written emails collected from open-source corpora and curated phishing repositories
- LLM-generated emails created for diversity
- Rephrasing via three pipelines:
- DeepSeek-Chat
- GPT-4o
- OpenRouter multi-model pipeline (Gemini, Nova, Grok, Llama, Mistral, etc.)
Four LLMs evaluated:
- GPT-4o-mini
- GPT-4.1-mini
- Claude 3.5 Sonnet
- DeepSeek-Chat
Evaluation metrics:
- Strict accuracy
- Close-enough accuracy
- Jaccard similarity
- Internal consistency across 5 independent runs
- Precision & recall
Claude 3.5 Sonnet was selected for full-dataset labeling due to highest match with human annotations.
Claude 3.5 Sonnet performed final classification using:
- Email body
- Subject line
- Sender metadata
- URL and attachment indicators
Evaluated using:
- Strict classification (Phishing / Spam / Valid)
- Relaxed classification (Unwanted vs. Valid)
- Robustness to paraphrasing across three pipelines
- Claude 3.5 Sonnet:
- Jaccard similarity = 0.60
- Close-enough accuracy = 42%
- Motivational detection harder, but top models achieve 53–61% close-enough accuracy
- LLMs often infer additional plausible motivations beyond human annotations
Across all email groups (Original, DeepSeek-rephrased, GPT-4o-rephrased, RandomAPI):
- Strict accuracy: ~66–67%
- Relaxed accuracy: ~69–70%
- Phishing detection excellent (F1 ≈ 0.93)
- Spam detection weak (F1 ≈ 0.20–0.23)
- Valid classification moderate (F1 ≈ 0.63)
Maximum deviation from original:
- Strict accuracy deviation: 0.55 percentage points
- Relaxed accuracy deviation: 0.54 percentage points
Rephrasing has minimal impact on classifier performance.
Running stats.py produces:
- Strict and relaxed accuracy
- Confusion matrices
- Group-by-group metrics
- Paraphrasing robustness analysis
- LaTeX-ready tables for publications