Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

A hands-on educational walkthrough of training a CelebA (Eyeglasses) image classifier with Differentially Private SGD using PyTorch and Opacus. The focus of this repo is on clarity and reproducibility through balanced subsets, deterministic preprocessing, and side-by-side baseline vs. DP training, while acknowledging real trade-offs.

Notifications You must be signed in to change notification settings

carolinedotxyz/dp_sgd_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Status: WIP Python PyTorch Medium Article

Educational SGD vs. DP-SGD Image Classifier

CelebA β€” Eyeglasses Attribute Classification

An end-to-end educational workflow for understanding differential privacy in deep learning

Overview β€’ Quick Start β€’ Features β€’ Project Structure β€’ Documentation

πŸ“– For a full conceptual walkthrough, see my accompanying Medium article:
Teaching Vision Models to Forget β€” The True Cost of Privacy in Deep Learning


πŸ“– Overview

This repository provides a complete, educational workflow for training image classifiers with differential privacy. From raw CelebA images to balanced subset creation, preprocessing, baseline CNN training, and differentially private training with DP-SGD (via Opacus).

Design Philosophy: Instead of diving straight into opaque training loops, we prioritize clarity over complexity. Each stage is observable, testable, and modifiable β€” designed as a teaching tool.

Training Dynamics: Baseline vs DP-SGD

Training dynamics comparison: Baseline SGD vs. DP-SGD

🎯 Primary Entry Point

The main entry point is notebooks/celeba_eyeglasses_workflow.ipynb

This notebook provides an end-to-end educational workflow where you:

  • Analyze the CelebA dataset and create balanced subsets
  • Preprocess images with transparent, observable steps
  • Train both baseline and DP-SGD models with matched hyperparameters
  • Visualize and compare privacy-accuracy trade-offs

Note: All steps are designed for learning. Each stage is observable, testable, and modifiable. Training can also be done programmatically using the modules in src/training/, but the notebook is recommended for understanding the complete workflow.


✨ Features

  • Differential Privacy: Full DP-SGD implementation using Opacus
  • Educational Focus: Transparent, well-documented code designed for learning
  • Matched-Pair Methodology: Identical hyperparameters for baseline and DP-SGD enable direct privacy cost quantification
  • Centralized Configuration: YAML-based config system for easy experimentation
  • Comprehensive Visualization: Training curves, privacy-utility tradeoffs, and comparisons
  • Tested & Reproducible: Full test suite and tracked experiments under runs/
  • Automated Workflows: Hyperparameter sweeps for both baseline and DP-SGD

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • pip

1️⃣ Installation

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -U pip
pip install -e .[dev]

2️⃣ Download the Data

  1. Download the CelebA dataset from Kaggle
  2. Extract and place the archive under:
    data/celeba/archive/
    
  3. The notebook will validate the expected files/structure automatically

3️⃣ Configuration (Optional)

The notebook uses a centralized YAML configuration system. Edit notebooks/config.yaml to customize:

  • Dataset paths and subset sizes
  • Preprocessing parameters
  • Training hyperparameters
  • Privacy parameters (for DP-SGD)

πŸ“– See notebooks/README_config.md for detailed configuration documentation.

4️⃣ Create Subset and Preprocess

Open and run: notebooks/celeba_eyeglasses_workflow.ipynb

The notebook will:

  • Build a balanced Eyeglasses subset (train/val/test)
  • Preprocess images (crop/resize/normalize)
  • Save dataset statistics needed by loaders

These artifacts are required before running training. We keep this step in-notebook to make the process transparent and learnable.

5️⃣ Train Models

After preprocessing, you can train models by:

  • Continuing in notebooks/celeba_eyeglasses_workflow.ipynb
  • The notebook includes training cells for both baseline and DP-SGD models
  • Uses matched hyperparameters for fair privacy-accuracy comparison
  • Includes visualization cells for analyzing results

πŸ—οΈ Project Structure

Click to expand project structure
dp_sgd_classification/
β”œβ”€β”€ data/                    # Dataset storage (CelebA archive, subsets, processed images)
β”‚   └── celeba/
β”‚       β”œβ”€β”€ archive/         # Raw CelebA dataset
β”‚       β”œβ”€β”€ subsets/         # Balanced subsets
β”‚       └── processed/       # Preprocessed images
β”œβ”€β”€ notebooks/               # Main workflow notebook, configs, and utilities
β”‚   β”œβ”€β”€ celeba_eyeglasses_workflow.ipynb  # Primary entry point
β”‚   β”œβ”€β”€ config.yaml          # Centralized configuration
β”‚   └── README_config.md     # Configuration guide
β”œβ”€β”€ scripts/                 # Standalone CLI scripts for data processing
β”‚   β”œβ”€β”€ celeba_analyze.py
β”‚   β”œβ”€β”€ celeba_build_subset.py
β”‚   β”œβ”€β”€ celeba_preprocess.py
β”‚   └── celeba_centering.py
β”œβ”€β”€ src/                     # Core Python package
β”‚   β”œβ”€β”€ config/              # Configuration management
β”‚   β”œβ”€β”€ core/                # Core ML components (models, data loaders, utils)
β”‚   β”œβ”€β”€ datasets/            # Dataset-specific code (CelebA workflow, analysis)
β”‚   β”œβ”€β”€ notebooks/           # Notebook utilities (display, setup, helpers)
β”‚   β”œβ”€β”€ training/            # Training loops, sweeps, visualization
β”‚   └── visualization/       # Plotting utilities
β”œβ”€β”€ tests/                   # Test suite
└── runs/                    # Training run outputs (configs, checkpoints, metrics)

πŸ§ͺ Testing

Click to expand testing instructions

Run the full test suite:

pytest -q

Or run specific test files:

pytest tests/test_data.py
pytest tests/test_train_baseline.py
pytest tests/test_train_dp.py

πŸ“¦ What You'll Find

Click to expand detailed component descriptions

Data Processing Scripts (scripts/)

Script Description
celeba_analyze.py Analyze CelebA attribute balance
celeba_build_subset.py Create balanced subsets with stratification
celeba_preprocess.py Preprocess images (crop/resize/normalize)
celeba_centering.py Analyze face centering using landmarks

Core Functionality (src/)

  • Configuration (src/config/): Centralized YAML-based config system with platform-specific workarounds (e.g., M1 Mac OpenMP fixes)
  • Subset Building (src/datasets/celeba/): Balanced Eyeglasses vs No Eyeglasses, with options to reduce confounding
  • Preprocessing (src/datasets/celeba/): Deterministic center-crop/resize, normalization, and saved dataset stats
  • Training (src/training/): Clear baseline (non-DP) and DP-SGD loops using PyTorch + Opacus
  • Hyperparameter Sweeps (src/training/): Automated grid searches for baseline and DP-SGD
  • Visualization (src/visualization/): Training curves, privacy-utility tradeoffs, and comparisons
  • Notebook Utilities (src/notebooks/): Helper functions for timestamps, config printing, validation

Notebook Features (notebooks/)

  • βœ… Centralized configuration: YAML-based config system (config.yaml) for easy experimentation
  • βœ… Helper utilities: Reusable functions (generate_timestamp, print_config, validation helpers)
  • βœ… Cell dependencies: Clear documentation of global state and dependencies in notebook header
  • βœ… Code quality: Well-documented, maintainable code following best practices
  • βœ… Matched-pair methodology: Identical hyperparameters for baseline and DP-SGD enable direct privacy cost quantification

Additional Features

  • πŸ”„ Reproducibility: Configs, metrics, and artifacts tracked under runs/
  • πŸ“š Documentation: Comprehensive guides in docs/ for data processing, training, and notebooks

πŸ”— Recommended Reading

Click to expand learning resources

If you're new to Differential Privacy, DP-SGD, or the CelebA dataset, the following resources provide helpful background:

Differential Privacy Fundamentals

  • "The Algorithmic Foundations of Differential Privacy" (Dwork & Roth) β€” canonical introduction
    πŸ“„ PDF

DP-SGD and Practical Implementations

CelebA Dataset Background

General DP Resources (Optional)


πŸ› οΈ Tech Stack

Click to expand technology stack
  • Python 3.8+
  • PyTorch - Deep learning framework
  • Opacus - Differential privacy for PyTorch
  • NumPy, Pandas - Data processing
  • Matplotlib, Seaborn - Visualization
  • Pytest - Testing framework

πŸ—ΊοΈ Roadmap

Click to expand roadmap
  • Additional privacy budget analysis tools
  • Support for more CelebA attributes
  • Extended visualization capabilities
  • Performance optimizations

πŸ™ Acknowledgments

Click to expand acknowledgments
  • CelebA Dataset - The Chinese University of Hong Kong
  • Opacus - Facebook AI Research for differential privacy tools

Built with ❀️ for educational purposes

About

A hands-on educational walkthrough of training a CelebA (Eyeglasses) image classifier with Differentially Private SGD using PyTorch and Opacus. The focus of this repo is on clarity and reproducibility through balanced subsets, deterministic preprocessing, and side-by-side baseline vs. DP training, while acknowledging real trade-offs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published