FSDL 2022
Deployment
Josh Tobin
SEPTEMBER 5, 2022
“All-in-one”
Feature
Store Monitoring
Versioning Labeling
Frameworks & Experiment and Model
Distributed Training Management
Edge Web
Processing Exploration
Datasets Resource Management Software Engineering CI / Testing
or or
Sources Compute
Data Development Deployment
Testing & Deployment - Overview FSDL 2022 2
FSDL 20223
FSDL 20224
FSDL 2022
Only IRL do you see how your model actually works
• Deploy early, deploy often
• Keep it simple, and add complexity later
- Build a prototype
- Separate your model and UI
- Learn the tricks to scale
- Consider moving your model to the edge when you *really* need
to go fast ⚡
5
FSDL 2022
Step 1: build a prototype you and your
friends / teammates can interact with
FSDL 2022
Tools for prototype deployment
7
FSDL 2022
Prototype deployment: best practices
• Have a basic UI
- Easier for other folks to try it and give feedback
- Gradio & Streamlit are your friends here
• Put it behind a web URL
- Easier to share
- Cloud versions of streamlit and huggingface are helpful here
• Don’t stress too much
8
FSDL 2022
• Limited frontend exibility
Where will this fail? • They don’t scale to many
concurrent requests: the model
becomes the bottleneck
fl
FSDL 2022
Where in the architecture should your model go?
Request
Local Remote
Response
Client Server Database
10
FSDL 2022
Model-in-service
Request
Local Remote
Response
Client Server Database
Model
11
FSDL 2022
Model-in-service
Pros Cons
• Web server may be written in a
di erent language
• Models may change more
frequently than server code
• Re-uses your existing
infrastructure • Large models can eat into the
resources for your web server
• Server hardware not optimized
for your model (e.g., no GPUs)
• Model & server may scale
di erently
12
ff
ff
FSDL 2022
Step 2: separate your model from
your UI
FSDL 2022
Option 1: Batch prediction
Request
Local Remote
Response
Client Server Database
Model
14
FSDL 2022
Batch prediction
• Periodically run your model on new data and save the results in a
database
• Works if the universe of inputs is relatively small (e.g., 1 prediction
per user, per client, etc)
- Recommender systems
- Marketing automation (e.g., lead segmentation)
15
FSDL 2022
Data processing / work ow tools work well here
• Re-run preprocessing
• Load the model
• Run predictions
• Store predictions
16
fl
FSDL 2022
Batch prediction
Pros Cons
• Simple to implement • Doesn’t scale to complex
input types (user-speci ed
• Scales easily queries, etc)
• Used in production by large- • Users don’t get the most up-
scale production systems for to-date predictions
years
• Models frequently become
• Fast to retrieve the prediction “stale”, which can be hard to
detect
17
fi
FSDL 2022
Model-as-service
Request
Local Remote
Response
Client Server Database
Model
18
FSDL 2022
Model-as-service
• Run your model on its own web server
• The backend (or the client itself) interact with the model by
making requests to the model service and receiving responses
back
19
FSDL 2022
Model-as-service
Pros Cons
• Dependability — model bugs • Can add latency
less likely to crash the web app
• Adds infrastructural complexity
• Scalability — choose optimal
hardware for the model and • Now you have to run a model
scale it appropriately service…
• Flexibility — easily reuse a
model across multiple apps
Sweet
spot for most ML-
powered
products!
20
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Rollout
• Managed options
21
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
22
FSDL 2022
REST APIs
• Serving predictions in response to canonically-formatted HTTP
requests
• There are alternatives like GRPC (which is actually used in
tensor ow serving) and GraphQL (not terribly relevant to model
services)
23
fl
FSDL 2022
REST API example
24
FSDL 2022
Formats for requests and responses
Google Cloud
Azure
• Sadly, no standard yet
AWS Sagemaker
25
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
26
FSDL 2022
Dependency management for model servers
• Model predictions depend on code, model weights, and dependencies. All
need to be present on your web server
• Dependencies cause trouble.
- Hard to make consistent
- Hard to update
- Even changing a tensor ow version can change your model
• Two strategies:
- Constrain the dependencies for your model
- Use containers
27
fl
Constraining model
dependencies
FSDL 2022
A standard neural net format: ONNX
• The promise: de ne network in any language, run it consistently
anywhere
• The reality: since the libraries change quickly, there are often bugs
in the translation layer
• What about non-library code like feature transformations?
https://github.com/sayakpaul/ml-deployment-k8s-fastapi/ 29
fi
Containers
FSDL 2022
Managing dependencies with containers (i.e., Docker)
• Docker vs VM
• Docker le and layer-based images
• DockerHub and the ecosystem
• Wrappers around Docker for ML
Testing & Deployment - Docker 31
fi
32
FSDL 2022
No OS -> Light weight
https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b
Testing & Deployment - Docker
FSDL 2022
Lightweight -> heavy use
• Spin up a container for every
discrete task
• For example, a web app might
have four containers:
- Web server
- Database
- Job queue
- Worker
https://www.docker.com/what-container
Testing & Deployment - Docker 33
FSDL 2022
Docker le
Testing & Deployment - Docker 34
fi
FSDL 2022
Strong Ecosystem
• Images are easy to nd,
modify, and contribute
back to DockerHub
• Private images easy to
store in same place
https://docs.docker.com/engine/docker-overview
Testing & Deployment - Docker 35
fi
FSDL 2022
Docker is incredibly popular: near ubiquitous
https://www.docker.com/what-container#/package_software
Testing & Deployment - Docker 36
FSDL 2022
This seems hard, can’t we simplify it?
37
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Rollout
• Managed options
38
FSDL 2022
Making inference on a single machine more e cient
• GPU or no GPU?
• Concurrency
• Model distillation
• Quantization
• Caching
• Batching
• Sharing the GPU
• Libraries
39
ffi
FSDL 2022
GPU or no GPU?
• GPU pros
- Same hardware you trained on probably
- In the limit of model size, batch size tuning, etc usually higher
throughput
• GPU cons
- More complex to set up
- Often more expensive
40
FSDL 2022
Just because your model was trained
on a GPU, it does not mean you need
to serve it on a GPU
FSDL 2022
Concurrency
• What?
- Multiple copies of the model running on di erent CPUs or cores
• How?
- Be careful about thread tuning
https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/
42
ff
FSDL 2022
Model distillation
• What?
- Train a smaller model to imitate your larger one
• How?
- Several techniques outlined below
- Can be nicky to do yourself — infrequently used in practice
- Exception — pretrained distilled models like DistilBERT
https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb
43
fi
FSDL 2022
Quantization
• What?
- Execute some or all of the operations in your model with a
smaller numerical representation than oats (e.g., INT8)
- Some tradeo s with accuracy
• How?
- PyTorch and Tensor ow Lite have quantization built-in
- Can also run quantization-aware training, which often results in
higher accuracy
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
44
ff
fl
fl
FSDL 2022
Quantization: tools
45
FSDL 2022
Caching
• What?
- For some ML models, some inputs
are more common than others
- Instead of always calling the
model, rst check the cache
• How?
- Can get very fancy
- Basic way uses functools
46
fi
FSDL 2022
Batching
• What?
- ML models often achieve higher throughput when doing prediction in
parallel, especially in a GPU
• How?
- Gather predictions until you have a batch, run prediction, return to user
- Batch size needs to be tuned
- You need to have a way to shortcut the process if latency becomes too
long
- Probably don’t want to implement this yourself
47
FSDL 2022
Sharing the GPU
• What?
- Your model may not take up all of the GPU memory with your
inference batch size. Why not run multiple models on the same
GPU?
• How?
- You’ll probably want to use a model serving solution that
supports this out of the box
48
FSDL 2022
Model serving libraries
49
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
50
FSDL 2022
Horizontal scaling
• What?
- If you have too much tra c for a single machine, split tra c among
multiple machines
• How?
- Spin up multiple copies of your service and split tra c using a load
balancer
- In practice, two common methods
• Container orchestration (i.e., Kubernetes)
• Serverless (e.g., AWS Lambda)
51
ffi
ffi
ffi
FSDL 2022
Container orchestration
52
FSDL 2022
Frameworks for ML deployment on kubernetes
53
FSDL 2022
Deploying code as serverless functions
• App code and dependencies are packaged into .zip les or docker
containers with a single entry point function
• AWS Lambda (or Google Cloud Functions, or Azure Functions)
Start here!
manages everything else: instant scaling to 10,000+ requests per
second, load balancing, etc.
• Only pay for compute-time.
Testing & Deployment - Web Deployment 54
fi
55
FSDL 2022
FSDL 2022
Deploying code as serverless functions
• Cons:
- Limited size of deployment package
- Cold start
- Can be challenging to build pipelines of models
- Little to no state management (e.g., for caching)
- Limited deployment tooling
- CPU-only, limited execution time
Testing & Deployment - Web Deployment 56
FSDL 2022
How far away are we from serverless GPUs?
57
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Rollout
• Managed options
58
FSDL 2022
Model rollouts
• What?
- If serving is how you turn a model into something that can respond to
requests, rollouts is how you manage, and update these services
• How?
- You probably want to be able to roll out gradually, roll back instantly,
split tra c between versions, and deploy pipelines of models
- This is a challenging infra problem, and beyond the scope of this
lecture
- Your deployment library (or infra team) may take care of this for you
59
ffi
FSDL 2022
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Rollout
• Managed options
60
FSDL 2022
Managed options
Cloud providers End-to-end ML platforms Startups
61
FSDL 2022
Double-click on Sagemaker
62
FSDL 2022
It’s easy, but more expensive
• ~50-100% more expensive than raw EC2, depending on instance
type
• Serverless is a better deal: ~20% more expensive than Lambda
63
FSDL 2022
Building a model service: takeaways
• If you are doing CPU inference, can get away with scaling by
launching more servers, or going serverless
• Serverless makes sense if you can get away with CPUs and tra c is
spiky or low-volume
• Sagemaker is a perfectly good way to get started if you’re on AWS,
but can get expensive
• If using GPU inference, serving tools like TF serving, Triton, and torch
serve will save you time
• Worth keeping an eye on the startups in this space for GPU inference
64
ffi
FSDL 2022
Step 3: move to the edge?
FSDL 2022
When should you consider edge deployment?
• Sometimes, it’s obvious
- No reliable internet connection
- Very strict data security / privacy requirements
• Otherwise:
- Accuracy and latency both a ect end-user experience
- Latency includes the network roundtrip and model prediction time
- Once you’ve exhausted options for reducing model pred time,
consider edge
66
ff
FSDL 2022
Edge prediction
Request
Local Remote
Response
Client Server Database
Model
67
FSDL 2022
Edge prediction
• Send model weights to the client device
• Client loads the model and interacts with it directly
68
FSDL 2022
Edge deployment
Pros Cons
• Lowest-latency • Often limited hardware
resources available
• Does not require an internet
connection • Embedded and mobile
frameworks are less full
• Data security — data doesn’t featured than tensor ow /
need to leave the user’s device pytorch
• Scale comes “for free” • Di cult to update models
• Di cult to monitor and debug
when things go wrong
69
ffi
ffi
fl
Frameworks
FSDL 2022
Tools for edge deployment
TensorRT: Model optimizer and inference runtime for tensor ow + NVIDIA devices
71
fl
72
FSDL 2022
- Released at Apple WWDC 2017
CoreML - Inference only
- https://coreml.store/
- Announced at Google I/O
2018
- Either via API or on-device
- O ers pre-trained models,
or can upload Tensor ow
Lite model
Testing & Deployment - Hardware/Mobile
ff
fl
FSDL 2022
Tools for edge deployment
PyTorch mobile: PyTorch on iOS and Android
73
FSDL 2022
Tools for edge deployment
TFLite: tensor ow on mobile / edge devices
74
fl
FSDL 2022
Tools for edge deployment
TensorFlow.js: tensor ow in the browser
75
fl
FSDL 2022
Tools for edge deployment
Apache TVM: Library-agnostic and target-device agnostic inference runtime
76
FSDL 2022
Watch this space
MLIR
77
E ciency
ffi
FSDL 2022
More e cient models
• Quantization and distillation from above
• Mobile-friendly model architectures
79
ffi
FSDL 2022
MobileNets
Standard MobileNet
1x1
Inception
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shu enet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile 80
ffl
FSDL 2022
MobileNets
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shu enet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile 81
ffl
FSDL 2022
Recommended case study: DistillBERT
https://medium.com/huggingface/distilbert-8cf3380435b5
82
FSDL 2022
Mindsets for edge deployment
• Choose your architecture with your target hardware in mind
- You can make up a factor of 2-10 through distillation, quantization,
and other tricks, but not more than that
• Once you have a model that works on your edge device, you can iterate
locally as long as you add model size and latency to your metrics and
avoid regressions
• Treat tuning the model for your device as an additional risk in the
deployment cycle and test it accordingly
- E.g., always test your models on production hardware before deploying
• Since models can be nicky, it’s a good idea to build fallback
mechanisms into the application in case the model fails or is too slow
83
fi
FSDL 2022
Edge deployment: conclusion
• Web deployment is easier, so use it if you need to
• Choose your framework to match the available hardware and
corresponding mobile frameworks, or try TVM to be more exible
• Start considering hardware constraints at the beginning of the
project and choose architectures accordingly
84
fl
FSDL 2022
Only IRL do you see how your model actually works
• Deploy early, deploy often
• Keep it simple, and add complexity later
- Build a prototype
- Separate your model and UI
- Learn the tricks to scale
- Consider moving your model to the edge when you *really* need
to go fast ⚡
85