An end-to-end educational workflow for understanding differential privacy in deep learning
Overview β’ Quick Start β’ Features β’ Project Structure β’ Documentation
π For a full conceptual walkthrough, see my accompanying Medium article:
Teaching Vision Models to Forget β The True Cost of Privacy in Deep Learning
This repository provides a complete, educational workflow for training image classifiers with differential privacy. From raw CelebA images to balanced subset creation, preprocessing, baseline CNN training, and differentially private training with DP-SGD (via Opacus).
Design Philosophy: Instead of diving straight into opaque training loops, we prioritize clarity over complexity. Each stage is observable, testable, and modifiable β designed as a teaching tool.
The main entry point is notebooks/celeba_eyeglasses_workflow.ipynb
This notebook provides an end-to-end educational workflow where you:
- Analyze the CelebA dataset and create balanced subsets
- Preprocess images with transparent, observable steps
- Train both baseline and DP-SGD models with matched hyperparameters
- Visualize and compare privacy-accuracy trade-offs
Note: All steps are designed for learning. Each stage is observable, testable, and modifiable. Training can also be done programmatically using the modules in
src/training/, but the notebook is recommended for understanding the complete workflow.
- Differential Privacy: Full DP-SGD implementation using Opacus
- Educational Focus: Transparent, well-documented code designed for learning
- Matched-Pair Methodology: Identical hyperparameters for baseline and DP-SGD enable direct privacy cost quantification
- Centralized Configuration: YAML-based config system for easy experimentation
- Comprehensive Visualization: Training curves, privacy-utility tradeoffs, and comparisons
- Tested & Reproducible: Full test suite and tracked experiments under
runs/ - Automated Workflows: Hyperparameter sweeps for both baseline and DP-SGD
- Python 3.8+
- pip
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -U pip
pip install -e .[dev]- Download the CelebA dataset from Kaggle
- Extract and place the archive under:
data/celeba/archive/ - The notebook will validate the expected files/structure automatically
The notebook uses a centralized YAML configuration system. Edit notebooks/config.yaml to customize:
- Dataset paths and subset sizes
- Preprocessing parameters
- Training hyperparameters
- Privacy parameters (for DP-SGD)
π See notebooks/README_config.md for detailed configuration documentation.
Open and run: notebooks/celeba_eyeglasses_workflow.ipynb
The notebook will:
- Build a balanced Eyeglasses subset (train/val/test)
- Preprocess images (crop/resize/normalize)
- Save dataset statistics needed by loaders
These artifacts are required before running training. We keep this step in-notebook to make the process transparent and learnable.
After preprocessing, you can train models by:
- Continuing in
notebooks/celeba_eyeglasses_workflow.ipynb - The notebook includes training cells for both baseline and DP-SGD models
- Uses matched hyperparameters for fair privacy-accuracy comparison
- Includes visualization cells for analyzing results
Click to expand project structure
dp_sgd_classification/
βββ data/ # Dataset storage (CelebA archive, subsets, processed images)
β βββ celeba/
β βββ archive/ # Raw CelebA dataset
β βββ subsets/ # Balanced subsets
β βββ processed/ # Preprocessed images
βββ notebooks/ # Main workflow notebook, configs, and utilities
β βββ celeba_eyeglasses_workflow.ipynb # Primary entry point
β βββ config.yaml # Centralized configuration
β βββ README_config.md # Configuration guide
βββ scripts/ # Standalone CLI scripts for data processing
β βββ celeba_analyze.py
β βββ celeba_build_subset.py
β βββ celeba_preprocess.py
β βββ celeba_centering.py
βββ src/ # Core Python package
β βββ config/ # Configuration management
β βββ core/ # Core ML components (models, data loaders, utils)
β βββ datasets/ # Dataset-specific code (CelebA workflow, analysis)
β βββ notebooks/ # Notebook utilities (display, setup, helpers)
β βββ training/ # Training loops, sweeps, visualization
β βββ visualization/ # Plotting utilities
βββ tests/ # Test suite
βββ runs/ # Training run outputs (configs, checkpoints, metrics)
Click to expand testing instructions
Run the full test suite:
pytest -qOr run specific test files:
pytest tests/test_data.py
pytest tests/test_train_baseline.py
pytest tests/test_train_dp.pyClick to expand detailed component descriptions
| Script | Description |
|---|---|
celeba_analyze.py |
Analyze CelebA attribute balance |
celeba_build_subset.py |
Create balanced subsets with stratification |
celeba_preprocess.py |
Preprocess images (crop/resize/normalize) |
celeba_centering.py |
Analyze face centering using landmarks |
- Configuration (
src/config/): Centralized YAML-based config system with platform-specific workarounds (e.g., M1 Mac OpenMP fixes) - Subset Building (
src/datasets/celeba/): Balanced Eyeglasses vs No Eyeglasses, with options to reduce confounding - Preprocessing (
src/datasets/celeba/): Deterministic center-crop/resize, normalization, and saved dataset stats - Training (
src/training/): Clear baseline (non-DP) and DP-SGD loops using PyTorch + Opacus - Hyperparameter Sweeps (
src/training/): Automated grid searches for baseline and DP-SGD - Visualization (
src/visualization/): Training curves, privacy-utility tradeoffs, and comparisons - Notebook Utilities (
src/notebooks/): Helper functions for timestamps, config printing, validation
- β
Centralized configuration: YAML-based config system (
config.yaml) for easy experimentation - β
Helper utilities: Reusable functions (
generate_timestamp,print_config, validation helpers) - β Cell dependencies: Clear documentation of global state and dependencies in notebook header
- β Code quality: Well-documented, maintainable code following best practices
- β Matched-pair methodology: Identical hyperparameters for baseline and DP-SGD enable direct privacy cost quantification
- π Reproducibility: Configs, metrics, and artifacts tracked under
runs/ - π Documentation: Comprehensive guides in
docs/for data processing, training, and notebooks
Click to expand learning resources
If you're new to Differential Privacy, DP-SGD, or the CelebA dataset, the following resources provide helpful background:
- "The Algorithmic Foundations of Differential Privacy" (Dwork & Roth) β canonical introduction
π PDF
-
Original DP-SGD Paper: "Deep Learning with Differential Privacy"
π arXiv -
Opacus (PyTorch) β DP-SGD Documentation
π Website -
TensorFlow Privacy β DP-SGD Overview
π GitHub
-
CelebA Dataset Paper ("Deep Learning Face Attributesβ¦")
π arXiv -
CelebA Dataset Homepage & Documentation
π Website
-
Apple Differential Privacy Technical Overview
π PDF -
Google's RAPPOR (Local DP technique)
π Research Paper
Click to expand technology stack
- Python 3.8+
- PyTorch - Deep learning framework
- Opacus - Differential privacy for PyTorch
- NumPy, Pandas - Data processing
- Matplotlib, Seaborn - Visualization
- Pytest - Testing framework
Click to expand roadmap
- Additional privacy budget analysis tools
- Support for more CelebA attributes
- Extended visualization capabilities
- Performance optimizations
Click to expand acknowledgments
- CelebA Dataset - The Chinese University of Hong Kong
- Opacus - Facebook AI Research for differential privacy tools
Built with β€οΈ for educational purposes