Practical Problems & Projects
11-767: On-Device Machine Learning
Prof. Emma Strubell
Recognizing People in Photos
Through Private On-Device Machine
Learning
Floris Chabert, Jingwen Zhu, Brett Keating, and Vinay Sharma
Apple Research Blog July 2021
https://machinelearning.apple.com/research/recognizing-people-photos
Yonatan Bisk & Emma Strubell
Task
Find faces of contacts
Why is this hard?
Lighting, perspective,
skin color, age, gender, …
Why is this hard On-Device?
Motivation
On-Device face recognition is privacy preserving
Context:
• Competitors in the market (e.g. Google) use cloud based services and so your
data is shared.
• Apple has their own neural engine for acceleration.
• Quality vs battery.
What is a naive algorithm we might use?
Notes:
Inference Pipeline 1. Two feature representations (2 * model)
2. Agglomerative Clustering (naively expensive)
3. Use of external metadata (can correct for weak model)
Clustering
1. Conservative embedding clusters (very few merges - within moments?)
Relies on hand-tuned weighting for face (vs mean face) and body
Dij = min(Fij, α ⋅ Fij + β ⋅ Tij) where F & T are face and body, respectively
2. Agglomerative Clustering (Faces only)
First pass (ideal): “median distance between the members of two HAC clusters”
After threshold: “random sampling” ← Maintain linear runtime (no guarantees)
Note: runs periodically,
typically overnight during device charging
Assigning Identity
• Every cluster has c “canonical” exemplars:
1 1 1 2 2 2 K K
D= [X0 , X1 , . . . Xc , X0 , X1 , . . . Xc , X1 , . . . Xc ]
• Construct a representation for the input as a function of the dictionary (existing
2
clusters): min ∥y − D ⋅ x∥2 + λ ⋅ ∥x∥1
x
This reduces to a convex optimization (least squares) for the values in x
• So this is quickly learnable (optimally)
i
• Now the values in xj for a given X* de ne the cluster
fi
Network Design
Network Design
“highest accuracy possible while running ef ciently on-device, with low latency and a thin memory pro le”
• Skipping important details here because the model is
largely based around MobileNet which will be discussed
later, BUT
• Double channels “within limits of computation”
• Bottleneck expansions are smaller and added attention
at every layer
• PReLU
fi
fi
Network Design
“highest accuracy possible while running ef ciently on-device, with low latency and a thin memory pro le”
• Wider ~= performance to deeper (but faster)
Zagoruyko, S., Komodakis, N.: Wide residual networks
• Attention adds performance with little to no new parameters
fi
fi
Performance of Attention
Training (Focus on normalization and cos)
Margin ensures weighting on hard examples
Training (Focus on normalization and cos)
Other Considerations
1. Filtering Unclear Faces (no details)
2. Augmentations: “pixel-level changes such as color jitter or grayscale conversion, structural
changes like left-right ipping or distortion, Gaussian blur, random compression artifacts and
cutout regularization”
3. COVID-19: “we designed a synthetic mask augmentation. We used face landmarks to
generate a realistic shape corresponding to a face mask. We then overlaid random samples
from clothing and other textures in the inferred mask area over the input face”
fl
Qualitative
Key Components
• Optimized clustering (constant time)
• Assignment via Convex coding (minimal updates)
• Wider (shallower) networks
Questions:
1. Was the attention worth it?
2. Was this only possible because of the neural engine?
Course Project
Anatomy of the Course Project
We provide: You decide:
• Lab 2: Benchmarking • Hardware: Laptop, robot, RPi…
• Lab 3: Quantization • Model: ResNet, Transformer,
encoder vs. decoder…
• Lab 4: Pruning
• Data: Language, vision, …
Same as training data, or transfer/
adaptation setting?
Example Projects
• AnySurface: Converting any surface into a
controller by compressing UNet, run on RPi4.
• Speech-to-text translation: Automatic speech
recognition and translation on RPi4.
• Im2Cal: Estimating food calories from image by
compressing Segformer, on RPi4.
• Hey where’s that thing: Temporal localization in
videos by compressing 2D-TAN on laptop.
• Shazaam: On-device music recognition w/
FAISS, separable convolutions.
Example Projects
• Plant Jones: Smart assistant, who is also a plant.
• v1.0 (2015): Find tweets with positive/negative
sentiment about water, post positive sentiment ones
when well watered, negative when thirsty (dry).
• V2.0 (2023): Use an LLM to generate thirst-related
conversation. Also:
— Custom wake-word detection (“hey plant!”)
— Text-to-speech
— Speech-to-text
— Tiny LCD screen mouth
• ^This is an example baseline using open source
software and libraries — implemented using out-of-the-
box tools over about a week.
Axes to Consider
• Theory or practice? Resource optimized vs resource constrained?
• Target hardware:
CPU + RAM vs GPU/M1 + Shared RAM vs GPU+CPU + Separate RAM
• Hardware support: Logic, quantization, sparse ops, batching…
• Novelty: Reproduction vs transfer (new data/hardware) vs novel?
• In-distribution or transfer: Fitting to in-distribution data, vs. adapting to a new
task or domain?
Efficiency in Theory versus Practice
Resource Optimized Resource Constrained
• Magnitude pruning • Structured pruning / layer pruning
• Server • Edge device
• Quantization (3-bit) • Quantization (8-bit) if hw supports
Target hardware considerations
In addition to devices, we can provide: $100 AWS and $50 OpenAI credits
per student.
• Where do you store model weights, activations, gradients?
How does this impact latency?
• Trade-off between storage size, speed, and on-the- y computation
• Do I want on-device training? Fine-tuning?
• How heavy is the OS? How heavy are USB vs GPIO?
• Does your hardware support ef cient batched computation? Ef cient low-
bitwidth computation? Ef cient control ow?
fi
fi
fl
fl
fi
Project Ideas
Resource Optimized
• Does ef cient method X published in a CV venue apply to NLP, or vice versa?
• Does theoretically proven idea Y published in ML venue apply to larger, more
complex models and datasets?
Resource Constrained
• Does “ef cient” method Z, evaluated on GPU/TPU, work on CPU/Edge? Under
memory constraints? Power constraints?
• Can you further optimize an already-ef cient model?
Can you compress a huge model enough to t it on device?
All of the above
• Compare existing methods across different metrics: Pareto optimality,
generalization, fairness, …
fi
fi
fi
fi
Learning Goals
Project is not Project is
• Entrepreneurship 101 • Measuring Ef ciency and Power
• Multimodal Machine Learning • Adjusting data for 👆
(amazing class)
• Changing architectures for 👆
• Graded based on model performance
• Producing Pareto curves for 👆
• Real world robotics
fi
Plotting Goals
Where to start?
• What pre-trained models exist for my task?
• What is a baseline I can feasibly train/evaluate in a few hours?
• How can I sub-sample my data to create a feasible train/test set?
• Single domain? Limited label space? Simpli ed task?
• Goal: Performance that’s non-trivial but do not need competitive performance
• What is unique about my data/task/… that makes me think I can
compress my models?
fi