Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views85 pages

FSDL 2022 Lecture5 Deployment

The document outlines best practices for deploying machine learning models, emphasizing the importance of early and simple deployment, and separating model and UI components. It discusses various deployment strategies, including prototype deployment, batch prediction, and model-as-service, along with their pros and cons. Additionally, it covers technical aspects such as REST APIs, dependency management, performance optimization, and the use of containers for efficient model serving.

Uploaded by

ritika26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views85 pages

FSDL 2022 Lecture5 Deployment

The document outlines best practices for deploying machine learning models, emphasizing the importance of early and simple deployment, and separating model and UI components. It discusses various deployment strategies, including prototype deployment, batch prediction, and model-as-service, along with their pros and cons. Additionally, it covers technical aspects such as REST APIs, dependency management, performance optimization, and the use of containers for efficient model serving.

Uploaded by

ritika26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

FSDL 2022

Deployment
Josh Tobin

SEPTEMBER 5, 2022
“All-in-one”

Feature
Store Monitoring
Versioning Labeling

Frameworks & Experiment and Model


Distributed Training Management

Edge Web
Processing Exploration

Datasets Resource Management Software Engineering CI / Testing

or or
Sources Compute

Data Development Deployment

Testing & Deployment - Overview FSDL 2022 2


FSDL 20223
FSDL 20224
FSDL 2022

Only IRL do you see how your model actually works

• Deploy early, deploy often

• Keep it simple, and add complexity later

- Build a prototype

- Separate your model and UI

- Learn the tricks to scale

- Consider moving your model to the edge when you *really* need
to go fast ⚡

5
FSDL 2022

Step 1: build a prototype you and your


friends / teammates can interact with
FSDL 2022

Tools for prototype deployment

7
FSDL 2022

Prototype deployment: best practices


• Have a basic UI

- Easier for other folks to try it and give feedback

- Gradio & Streamlit are your friends here

• Put it behind a web URL

- Easier to share

- Cloud versions of streamlit and huggingface are helpful here

• Don’t stress too much

8
FSDL 2022

• Limited frontend exibility

Where will this fail? • They don’t scale to many


concurrent requests: the model
becomes the bottleneck
fl
FSDL 2022

Where in the architecture should your model go?


Request
Local Remote
Response

Client Server Database

10
FSDL 2022

Model-in-service
Request
Local Remote
Response

Client Server Database


Model

11
FSDL 2022

Model-in-service
Pros Cons

• Web server may be written in a


di erent language

• Models may change more


frequently than server code
• Re-uses your existing
infrastructure • Large models can eat into the
resources for your web server

• Server hardware not optimized


for your model (e.g., no GPUs)

• Model & server may scale


di erently

12
ff
ff
FSDL 2022

Step 2: separate your model from


your UI
FSDL 2022

Option 1: Batch prediction


Request
Local Remote
Response

Client Server Database

Model
14
FSDL 2022

Batch prediction

• Periodically run your model on new data and save the results in a
database

• Works if the universe of inputs is relatively small (e.g., 1 prediction


per user, per client, etc)

- Recommender systems

- Marketing automation (e.g., lead segmentation)

15
FSDL 2022

Data processing / work ow tools work well here

• Re-run preprocessing

• Load the model

• Run predictions

• Store predictions

16
fl
FSDL 2022

Batch prediction
Pros Cons

• Simple to implement • Doesn’t scale to complex


input types (user-speci ed
• Scales easily queries, etc)
• Used in production by large- • Users don’t get the most up-
scale production systems for to-date predictions
years
• Models frequently become
• Fast to retrieve the prediction “stale”, which can be hard to
detect

17
fi
FSDL 2022

Model-as-service
Request
Local Remote
Response

Client Server Database

Model
18
FSDL 2022

Model-as-service

• Run your model on its own web server

• The backend (or the client itself) interact with the model by
making requests to the model service and receiving responses
back

19
FSDL 2022

Model-as-service
Pros Cons

• Dependability — model bugs • Can add latency


less likely to crash the web app
• Adds infrastructural complexity
• Scalability — choose optimal
hardware for the model and • Now you have to run a model
scale it appropriately service…

• Flexibility — easily reuse a


model across multiple apps
Sweet
spot for most ML-
powered
products!

20
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Rollout

• Managed options

21
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options

22
FSDL 2022

REST APIs

• Serving predictions in response to canonically-formatted HTTP


requests

• There are alternatives like GRPC (which is actually used in


tensor ow serving) and GraphQL (not terribly relevant to model
services)

23
fl
FSDL 2022

REST API example

24
FSDL 2022

Formats for requests and responses


Google Cloud

Azure

• Sadly, no standard yet

AWS Sagemaker

25
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options

26
FSDL 2022

Dependency management for model servers


• Model predictions depend on code, model weights, and dependencies. All
need to be present on your web server

• Dependencies cause trouble.

- Hard to make consistent

- Hard to update

- Even changing a tensor ow version can change your model

• Two strategies:

- Constrain the dependencies for your model

- Use containers

27
fl
Constraining model
dependencies
FSDL 2022

A standard neural net format: ONNX

• The promise: de ne network in any language, run it consistently


anywhere

• The reality: since the libraries change quickly, there are often bugs
in the translation layer

• What about non-library code like feature transformations?


https://github.com/sayakpaul/ml-deployment-k8s-fastapi/ 29
fi
Containers
FSDL 2022

Managing dependencies with containers (i.e., Docker)

• Docker vs VM
• Docker le and layer-based images
• DockerHub and the ecosystem
• Wrappers around Docker for ML

Testing & Deployment - Docker 31


fi
32
FSDL 2022

No OS -> Light weight

https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b
Testing & Deployment - Docker
FSDL 2022

Lightweight -> heavy use

• Spin up a container for every


discrete task

• For example, a web app might


have four containers:

- Web server

- Database

- Job queue

- Worker

https://www.docker.com/what-container
Testing & Deployment - Docker 33
FSDL 2022

Docker le

Testing & Deployment - Docker 34


fi
FSDL 2022

Strong Ecosystem

• Images are easy to nd,


modify, and contribute
back to DockerHub

• Private images easy to


store in same place

https://docs.docker.com/engine/docker-overview
Testing & Deployment - Docker 35
fi
FSDL 2022

Docker is incredibly popular: near ubiquitous

https://www.docker.com/what-container#/package_software
Testing & Deployment - Docker 36
FSDL 2022

This seems hard, can’t we simplify it?

37
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Rollout

• Managed options

38
FSDL 2022

Making inference on a single machine more e cient


• GPU or no GPU?

• Concurrency

• Model distillation

• Quantization

• Caching

• Batching

• Sharing the GPU

• Libraries

39

ffi
FSDL 2022

GPU or no GPU?

• GPU pros

- Same hardware you trained on probably

- In the limit of model size, batch size tuning, etc usually higher
throughput

• GPU cons

- More complex to set up

- Often more expensive

40
FSDL 2022

Just because your model was trained


on a GPU, it does not mean you need
to serve it on a GPU
FSDL 2022

Concurrency
• What?

- Multiple copies of the model running on di erent CPUs or cores

• How?

- Be careful about thread tuning

https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/
42
ff
FSDL 2022

Model distillation

• What?

- Train a smaller model to imitate your larger one

• How?

- Several techniques outlined below

- Can be nicky to do yourself — infrequently used in practice

- Exception — pretrained distilled models like DistilBERT

https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb

43
fi
FSDL 2022

Quantization
• What?

- Execute some or all of the operations in your model with a


smaller numerical representation than oats (e.g., INT8)

- Some tradeo s with accuracy

• How?

- PyTorch and Tensor ow Lite have quantization built-in

- Can also run quantization-aware training, which often results in


higher accuracy
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

44
ff
fl
fl
FSDL 2022

Quantization: tools

45
FSDL 2022

Caching

• What?

- For some ML models, some inputs


are more common than others

- Instead of always calling the


model, rst check the cache

• How?

- Can get very fancy

- Basic way uses functools

46
fi
FSDL 2022

Batching
• What?
- ML models often achieve higher throughput when doing prediction in
parallel, especially in a GPU

• How?
- Gather predictions until you have a batch, run prediction, return to user
- Batch size needs to be tuned
- You need to have a way to shortcut the process if latency becomes too
long

- Probably don’t want to implement this yourself

47
FSDL 2022

Sharing the GPU

• What?

- Your model may not take up all of the GPU memory with your
inference batch size. Why not run multiple models on the same
GPU?

• How?

- You’ll probably want to use a model serving solution that


supports this out of the box

48
FSDL 2022

Model serving libraries

49
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options

50
FSDL 2022

Horizontal scaling
• What?
- If you have too much tra c for a single machine, split tra c among
multiple machines

• How?
- Spin up multiple copies of your service and split tra c using a load
balancer

- In practice, two common methods


• Container orchestration (i.e., Kubernetes)
• Serverless (e.g., AWS Lambda)

51
ffi
ffi
ffi
FSDL 2022

Container orchestration

52
FSDL 2022

Frameworks for ML deployment on kubernetes

53
FSDL 2022

Deploying code as serverless functions

• App code and dependencies are packaged into .zip les or docker
containers with a single entry point function
• AWS Lambda (or Google Cloud Functions, or Azure Functions)
Start here!
manages everything else: instant scaling to 10,000+ requests per
second, load balancing, etc.
• Only pay for compute-time.
Testing & Deployment - Web Deployment 54

fi
55
FSDL 2022
FSDL 2022

Deploying code as serverless functions

• Cons:
- Limited size of deployment package
- Cold start
- Can be challenging to build pipelines of models
- Little to no state management (e.g., for caching)
- Limited deployment tooling
- CPU-only, limited execution time
Testing & Deployment - Web Deployment 56
FSDL 2022

How far away are we from serverless GPUs?

57
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Rollout

• Managed options

58
FSDL 2022

Model rollouts

• What?
- If serving is how you turn a model into something that can respond to
requests, rollouts is how you manage, and update these services

• How?
- You probably want to be able to roll out gradually, roll back instantly,
split tra c between versions, and deploy pipelines of models

- This is a challenging infra problem, and beyond the scope of this


lecture

- Your deployment library (or infra team) may take care of this for you

59
ffi
FSDL 2022

Building a model service: the basics

• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Rollout

• Managed options

60
FSDL 2022

Managed options

Cloud providers End-to-end ML platforms Startups

61
FSDL 2022

Double-click on Sagemaker

62
FSDL 2022

It’s easy, but more expensive

• ~50-100% more expensive than raw EC2, depending on instance


type

• Serverless is a better deal: ~20% more expensive than Lambda

63
FSDL 2022

Building a model service: takeaways

• If you are doing CPU inference, can get away with scaling by
launching more servers, or going serverless

• Serverless makes sense if you can get away with CPUs and tra c is
spiky or low-volume

• Sagemaker is a perfectly good way to get started if you’re on AWS,


but can get expensive

• If using GPU inference, serving tools like TF serving, Triton, and torch
serve will save you time

• Worth keeping an eye on the startups in this space for GPU inference

64

ffi
FSDL 2022

Step 3: move to the edge?


FSDL 2022

When should you consider edge deployment?

• Sometimes, it’s obvious


- No reliable internet connection
- Very strict data security / privacy requirements
• Otherwise:
- Accuracy and latency both a ect end-user experience
- Latency includes the network roundtrip and model prediction time
- Once you’ve exhausted options for reducing model pred time,
consider edge
66
ff
FSDL 2022

Edge prediction
Request
Local Remote
Response

Client Server Database

Model

67
FSDL 2022

Edge prediction

• Send model weights to the client device

• Client loads the model and interacts with it directly

68
FSDL 2022

Edge deployment
Pros Cons

• Lowest-latency • Often limited hardware


resources available
• Does not require an internet
connection • Embedded and mobile
frameworks are less full
• Data security — data doesn’t featured than tensor ow /
need to leave the user’s device pytorch
• Scale comes “for free” • Di cult to update models

• Di cult to monitor and debug


when things go wrong

69
ffi
ffi
fl
Frameworks
FSDL 2022

Tools for edge deployment

TensorRT: Model optimizer and inference runtime for tensor ow + NVIDIA devices
71

fl
72
FSDL 2022

- Released at Apple WWDC 2017

CoreML - Inference only


- https://coreml.store/

- Announced at Google I/O


2018
- Either via API or on-device
- O ers pre-trained models,
or can upload Tensor ow
Lite model

Testing & Deployment - Hardware/Mobile


ff
fl
FSDL 2022

Tools for edge deployment

PyTorch mobile: PyTorch on iOS and Android


73
FSDL 2022

Tools for edge deployment

TFLite: tensor ow on mobile / edge devices


74
fl
FSDL 2022

Tools for edge deployment

TensorFlow.js: tensor ow in the browser


75
fl
FSDL 2022

Tools for edge deployment

Apache TVM: Library-agnostic and target-device agnostic inference runtime


76
FSDL 2022

Watch this space

MLIR

77
E ciency
ffi
FSDL 2022

More e cient models

• Quantization and distillation from above

• Mobile-friendly model architectures

79
ffi
FSDL 2022

MobileNets
Standard MobileNet

1x1

Inception

https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shu enet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile 80

ffl
FSDL 2022

MobileNets

https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shu enet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile 81

ffl
FSDL 2022

Recommended case study: DistillBERT

https://medium.com/huggingface/distilbert-8cf3380435b5
82
FSDL 2022

Mindsets for edge deployment


• Choose your architecture with your target hardware in mind
- You can make up a factor of 2-10 through distillation, quantization,
and other tricks, but not more than that
• Once you have a model that works on your edge device, you can iterate
locally as long as you add model size and latency to your metrics and
avoid regressions
• Treat tuning the model for your device as an additional risk in the
deployment cycle and test it accordingly
- E.g., always test your models on production hardware before deploying
• Since models can be nicky, it’s a good idea to build fallback
mechanisms into the application in case the model fails or is too slow

83
fi
FSDL 2022

Edge deployment: conclusion

• Web deployment is easier, so use it if you need to

• Choose your framework to match the available hardware and


corresponding mobile frameworks, or try TVM to be more exible

• Start considering hardware constraints at the beginning of the


project and choose architectures accordingly

84

fl
FSDL 2022

Only IRL do you see how your model actually works

• Deploy early, deploy often

• Keep it simple, and add complexity later

- Build a prototype

- Separate your model and UI

- Learn the tricks to scale

- Consider moving your model to the edge when you *really* need
to go fast ⚡

85

You might also like