An Architect's Guide to SRE

Uploaded by

brijithvaishnavam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views375 pages

An Architect's Guide to SRE

Uploaded by

brijithvaishnavam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 375

https://www.wired.com/2015/10/margaret-hamilton-nasa-apollo/
Luckily she did manage to
update the documentation.
Allowed them to recover the data.
Doubt that would have turned into a
Hollywood blockbuster…
Hope is not a strategy.
But it is what rebellions are built on.
Failures, uh find a way.
Traditionally, systems
were run by sys admins.
AKA Prod Ops. Or something similar.
And that worked OK. For a while.
But look around your world today.
Service all the things!
Seems like everything is
aaS these days…
Infrastructure. Container. Platform.
Software. Function. Pizza.
Pretty sure about that last one…
Architecture as a Service…
Nothing new really.
CORBA anyone?
Facilitate communication for disparate
systems on diverse platforms…
Too soon?
EJB then.
Still have flashbacks?
Sorry. I’ve tried to forget most of it.
It didn’t stop there though did it?
Remember when SOA was
the new hotness?
Everything had to be all
SOA all the time!
There were books and talks and blogs
(remember those?), it was great!
How about API first?
Popularized in certain quarters.
Helpful?
How about this one?
Bet you use that one everyday.
Maybe without knowing it.
One of these perhaps?
What caused this Cambrian
explosion of APIs?
Technology changes.
Proliferation of computers in
everyone’s pockets.
Commoditized hardware.
The Cloud!
Companies were altering
their approach too.
Little company called Amazon
made a major change.
Well, it had happened years earlier.
But a publicly shared rant detailed it.

https://plus.google.com/+RipRowan/posts/eVeouesvaVX
Steve Yegge - the Bezos mandate.
All data will be exposed through a
public service interface.
These interfaces are *the*
communication method between teams.
No other form of communication is
allowed. No direct reads, links etc.
No back doors.
All services must be designed to
be public. No exceptions.
Don’t want to do this? You’re fired.
Unsurprisingly, things began to change.
And we learned some things.
Calls bounce between 20 services…
where was the problem?
Who do we page?
How do we monitor this
web of services?
How do I even *find*
these other services?
Debugging gets harder…
We’ve seen this story continue today.
Can’t swing a dry erase marker without
hitting someone talking about…
Microservices!
Bounded Context.
Domain-Driven Design.
Arguments over the
definition of microservice…
https://mobile.twitter.com/littleidea/status/500005289241108480
Rewrite it in two weeks.
Miniservice. Picoservice.
What do we even mean by
“application” today?!?
How about functions then?
Have we found a golden hammer yet?
Bet that would be helpful during
your next retro!
Turns out there are still engineering
issues we have to overcome.
It isn’t all puppies and rainbows.
Sorry.
Turns out, those things Yegge
mentions…are still things.
What would you say your
microservice call pattern looks like?
http://evolutionaryarchitecture.com
The traditional sys admin approach
doesn’t give us reliable services.
Inherent tension.
Conflicting incentives.
Developers want to
release early, release often.
Always Be Changing.
But sys admins want stability.
It works. No one touch anything.
Thus trench warfare.
Doesn’t have to be this way!
We can all get along.
What if we took a diﬀerent
approach to operations?
“what happens when you ask a software
engineer to design an operations team.”

https://landing.google.com/sre/book/chapters/introduction.html
Ultimately, this is just software
engineering applied to operations.
Replace manual tasks with automation.
Focus on engineering.
Many SREs are software engineers.
Helps to understand UNIX
internals or the networking stack.
Our operational approach has to evolve.
The “Review Board” meeting
once a quarter won’t cut it.
How do we move fast safely?
Operations must be able to
support a dynamic environment.
That is the core of what we mean by
site reliability engineering.
How we create a stable, reliable
environment for our services.
It doesn’t happen in spare cycles.
Make sure your SREs have time to
do actual engineering work.
On call, tickets, manual tasks -
shouldn’t eat up 100% of their day.
SREs need to focus on automating
away “toil” aka manual work.
Isn’t this just DevOps?
Can argue it is a natural
extension of the concept.
Think of SRE as a specific
implementation of DevOps.
SRE Responsibilities.
What should my SRE team focus on?
Availability. Stability. Latency.
Performance. Monitoring.
Capacity planning.
Emergency response.
Drive automation.
SREs help us establish our SLOs.
Embrace risk. Manage risk.
Risk is a continuum.
And a business decision.
What do our customers expect?
What do our competitors provide?
Cost.
Should those sites/apps have
had a redundant backup?
https://twitter.com/KentBeck/status/596007846887628801
How much would that have
cost those sites/apps?
How much more revenue would
that have driven for them?
It is a tradeoﬀ.
Long term vs. short term thinking.
Heroic eﬀorts can work
for the short term.
But that isn’t sustainable.
In the long run it may be better to
lower your SLOs for a short time…
To allow you to engineer a
better long term solution.
Mean time to recovery.
Having a run book helps immensely.
…thinking through and recording the
best practices ahead of time in a
"playbook" produces roughly a 3x
improvement in MTTR as compared to
the strategy of "winging it."

— Benjamin Treynor Sloss

Site Reliability Engineering
We don't rise to the level of our
expectations, we fall to the level of our
training.

— Archilochus
https://mobile.twitter.com/walfieee/status/953848431184875520
Design/implement
monitoring solutions.
Establish alerting.
Logging best practices.
Create dashboards.
Four Golden Signals.

https://landing.google.com/sre/book/chapters/monitoring-distributed-
systems.html#xref_monitoring_golden-signals
Latency - how long does it take to
service a request.
Traffic - level of demand on the
system. Requests/second. I/O rate.
Errors - failed requests. Can be
explicit, implicit or policy failure.
Saturation - how much of a
constrained resource do we have left.
Important to consider the
sampling frequency.
High resolution can be costly.
Aggregate data.
Some measures benefit from shorter
intervals…others not so much.
Establish alerting thresholds.
Alert levels can be hard to get right.
Should be actionable.
Should require a human.
Should result in a sense of urgency.
Implies we cannot have more than a
few pages a day - people burn out.
You can over alert and over monitor!
Eliminate toil.
Automate all the things.
Manual. Repetitive. Automatable.
One offs. Reactive. Grunt work.
Toil drives people out of SRE work.
Hurts morale. Sets a bad precedent.
People can’t do the same thing twice.
See golf.
We need consistency.
Deployment pipeline has
to be repeatable.
We cannot move fast safely unless
SREs are freed from toil.
Postmortems.
We will make mistakes.
Outages will still happen.
Vital we learn from those experiences.
Do not blamestorm.
“Blameless postmortems.”
Goal is to prevent it from
happening again.
Document the incident.
What happened?
What was the root cause(s)?
What can we do to prevent this
from happening in the future?
Be constructive, not sarcastic.
Consider a basic template.
Title/ID.
Authors.
Status.
Impact.
Root Causes.
Resolution.
Action Items.
Lessons Learned.
Timeline.
Whatever you think will help!
Can be difficult to create a
postmortem culture.
Consider a postmortem of the month.
Book club.
Wheel of Misfortune.
Role play a disaster you faced before.
Ease into it.
Recognize people for their participation.
Senior management needs to
encourage the behavior!
Perform retros on your postmortems!
Improve them!
We cannot learn anything without first
not knowing something.

— Mark Manson
The Subtle Art of Not Giving a F*ck
All services are equal. Some services
are more equal than others.
Defining our SLOs is a critical step
towards production hardened services.
Availability is one of our most
important objectives.
What is the availability goal
for this specific service?
SLAs vs. SLOs.
Terminology is often misused.
SLI - service level indicator.
A measure of some
aspect of our system.
Latency, error rate, throughput…
Availability.
Percentage of time your
service is available.
How much downtime can you tolerate?
99%: 7.20 hours a month,
14.4 minutes a day.
99.9%: 8.76 hours PER YEAR.
1.44 minutes a day.
99.99%: 4.38 minutes A MONTH.
8.66 seconds a day.
99.999%: 6.05 seconds A WEEK,
864.3 milliseconds a day.
By the way, the GKE’s availability
SLA is 99.5%. Just saying.

https://cloud.google.com/kubernetes-engine/sla
SLO - service level objective.
Our target value or a range of values.
Our SLO for availability is 99.9%.
Our average response time should be
less than 100 milliseconds.
Can be very tricky to pick an SLO!
More on that in a minute.
We don’t always get to choose our SLOs.
Our users may have a thing or two
to say about what they want!
We are also subject to the
things we depend on.
If you depend on a service that only
guarantees 99.9% availability…
You cannot guarantee 99.99%!
SLA - service level agreement.
Assign a consequence to
missing/meeting an SLO.
Often contractual and involve
some kind of financial penalty.
If there is no consequence,
we’re talking about an SLO.
SLAs are the realm of
product and legal decisions.
Many people say SLA
when they mean SLO.
Pedantic much?
Ubiquitous language and all that…
SLOs are not a purely technical decision.
Our business and our customers will
have an awful lot to say here.
Simple is good.
Infinity is not a goal.
More is not better.
SLOs should help you prioritize.
Give yourself some wiggle room -
you can always tighten them up later.
Ensure there is some margin for error.
May have an “internal” SLO that is
tighter than your advertised SLO.
Overachieving doesn’t help.
People will start to rely on
your overachieving!
Introduce some latency or
downtime to stay at your SLO.
Everyone wants 99.999%.
Everyone wants hot/hot.
Until they see the price tag.
If you have to ask…
Establish an error budget.
Establish your target availability.
Say 99%.
Your error budget is 1.0%.
Spend it however you want! Just
don’t go over that limit.
Goal is not zero outages. Goal is to
stay within the error budget.
Go ahead with those experiments.
Try diﬀerent things.
Once we use up the budget though…
Have to slow our roll.
Aligns incentives.
Helps everyone understand
the cost/benefit tradeoﬀ.
Working with SREs.
Production Readiness Reviews.
Not a one time, up front thing.
Services should be reviewed
and audited regularly.
Does not have to be high ceremony!
Get the team together - SREs, Devs,
etc. Draw up the architecture.
Do we have a shared understanding
of what the system does?
Do we have a shared understanding
of our requirements?
As we talk through it, we
will discover bottlenecks.
The Wombat service has a lower
availability level than we need.
We will find interesting failure cases.
“When month end falls on the
Super Blue Blood Moon.”
Review should result in a new
architecture diagram or two.
And probably some new
items on the backlog.
When we refer to an application or
microservice as “production-ready,” we
confer a great deal of trust upon it: we
trust it to behave reasonably, we trust it
to perform reliably, we trust it to get the
job done…
— Susan J. Fowler
Production-Ready Microservices
How do we know we
can trust a service?
Consider having a checklist.
A checklist? Seriously?
http://atulgawande.com/book/the-checklist-manifesto/
You know who uses checklists?
Pilots. Surgeons.
Should be quantifiable and measurable.
“Fast” won’t cut it.
Stability.
Reliability.
Scalability.
Fault tolerance.
Performance.
Monitoring.
Documentation.
I know what some of you are thinking…
I don’t have time for all this.
We need to MOVE FAST.
And break things…
We’re Agile. With a capital A.
How is your velocity when an outage
brings your business to a halt?
Requires buy in from the grass roots
level as well as management.
Perform an audit.
Go back to your checklist. Does the
service meet our requirements?
Probably results in some
new things in our backlog!
Now we can create a
production readiness roadmap.
What do we need to fix and
when can/should we fix it.
Drive prioritization of the work.
A lot of this is manual. But
some of it can be automated!
http://evolutionaryarchitecture.com
Fitness functions!
Basically, a set of tests we execute
to validate our architecture.
How close does this particular
design get us to our objectives?
Ideally, all automated. But we may
need some manual verifications.
For example…
All service calls must
respond within 100 ms.
Cyclomatic complexity
shall not exceed X.
Hard failure of an application
will spin up a new instance.
Reviews and audits should
not be additional red tape.
Should not be overly bureaucratic.
Couple of hours…depending on
the complexity of the system.
Architectural reviews.
Look for failure points.
Draw up the architecture.
What happens if *this* fails?
It can’t fail? Yeah it can -
what happens if it does?
Think through how our
service could fail.
It is hard. We are really good at
thinking through the happy path.
But we need to think about
the road less traveled.
Test for these failure scenarios. Does
our service respond appropriately?
Only one way to really know…
https://github.com/Netflix/SimianArmy
Chaos engineering.

http://principlesofchaos.org
Next steps.
Do you have an SRE team?
Do you wish you had an SRE team?
It can be built!
You probably have some engineers
that can (and would!) do it.
Our applications are changing radically.
Dev Ops
We all need to evolve to succeed.
We can’t move fast safely unless the
environment enables it.
We have to work together.
Want more?
https://landing.google.com/sre/book.html
Questions?
Thanks!
I’m a Software Presentation Modeling for
Architect, Patterns Software
Now What?
with Nate Shutta
with Neal Ford & Nate Schutta
Architects
with Nate Shutta Nathaniel T. Schutta
@ntschutta
ntschutta.io

Catchpoint 2021 SRE Report
No ratings yet
Catchpoint 2021 SRE Report
33 pages
Site Reliability Engineering
No ratings yet
Site Reliability Engineering
3 pages
Enterprise Site Reliability Engineering Contino
No ratings yet
Enterprise Site Reliability Engineering Contino
19 pages
Site Reliability Engineering (SRE)
No ratings yet
Site Reliability Engineering (SRE)
3 pages
IT601 Week 9.1
No ratings yet
IT601 Week 9.1
17 pages
Site Reliability Engineering v2
No ratings yet
Site Reliability Engineering v2
115 pages
SRE SRE: Site Reliability Engineering
No ratings yet
SRE SRE: Site Reliability Engineering
3 pages
On-Call in Action
No ratings yet
On-Call in Action
13 pages
SRE Foundation Blueprint
No ratings yet
SRE Foundation Blueprint
1 page
SRE and Incident Management
No ratings yet
SRE and Incident Management
58 pages
Sre 250821 235741
No ratings yet
Sre 250821 235741
5 pages
Chapter 4 - Services
No ratings yet
Chapter 4 - Services
48 pages
Sre JD
No ratings yet
Sre JD
1 page
SRE Insights for Google Cloud Users
No ratings yet
SRE Insights for Google Cloud Users
58 pages
Google Cloud DevOps Exam Prep Guide
No ratings yet
Google Cloud DevOps Exam Prep Guide
10 pages
Unit 05 - SRE
No ratings yet
Unit 05 - SRE
15 pages
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
100% (2)
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
43 pages
Developing A SRE Culture-English
No ratings yet
Developing A SRE Culture-English
4 pages
Site Reliability Engineering Course Content (SRE)
No ratings yet
Site Reliability Engineering Course Content (SRE)
5 pages
Site Reliability Engineering Ebook
100% (2)
Site Reliability Engineering Ebook
21 pages
Becoming SRE Engineer
No ratings yet
Becoming SRE Engineer
3 pages
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
100% (3)
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
18 pages
SRE Basics for IT Professionals
No ratings yet
SRE Basics for IT Professionals
5 pages
White Paper - EDT11 - Site Reliability Engine
No ratings yet
White Paper - EDT11 - Site Reliability Engine
7 pages
SRE Essentials: Key Principles & Practices
100% (1)
SRE Essentials: Key Principles & Practices
20 pages
Ebook The Sre Transformation
No ratings yet
Ebook The Sre Transformation
8 pages
What Is SRE
100% (1)
What Is SRE
40 pages
Soal UN Bahasa Inggris SMP Kelas IX Latihan 1
No ratings yet
Soal UN Bahasa Inggris SMP Kelas IX Latihan 1
4 pages
Google Cloud DevOps Engineer Exam Prep Sheet
No ratings yet
Google Cloud DevOps Engineer Exam Prep Sheet
16 pages
SRE Course for FAANG Aspirants
No ratings yet
SRE Course for FAANG Aspirants
13 pages
Event-Driven Microservices
No ratings yet
Event-Driven Microservices
25 pages
SRE Google Notes
100% (1)
SRE Google Notes
8 pages
Lessons Learned From Two Decades
No ratings yet
Lessons Learned From Two Decades
8 pages
Gerald Barry Piano Quartet No. 1
No ratings yet
Gerald Barry Piano Quartet No. 1
3 pages
SREF Blueprint
No ratings yet
SREF Blueprint
1 page
SRE 21 ShivagamiGugan SlideDeck
No ratings yet
SRE 21 ShivagamiGugan SlideDeck
27 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
13 pages
SRE Success: Philosophy, Tools, Habits
No ratings yet
SRE Success: Philosophy, Tools, Habits
31 pages
Practical Work 2 - Designing SLOs and SLIs
No ratings yet
Practical Work 2 - Designing SLOs and SLIs
4 pages
M2 - DevOps, SRE, and Why They Exist
No ratings yet
M2 - DevOps, SRE, and Why They Exist
34 pages
Telling Time Worksheets
100% (1)
Telling Time Worksheets
30 pages
Application Log, Deletion of Logs (BALDAT Management and Utilisation)
No ratings yet
Application Log, Deletion of Logs (BALDAT Management and Utilisation)
6 pages
Site Reliability Engineering Ebook PDF
No ratings yet
Site Reliability Engineering Ebook PDF
21 pages
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
No ratings yet
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
12 pages
SRE-Lecture 2-Principles OF SRE
No ratings yet
SRE-Lecture 2-Principles OF SRE
46 pages
SRE Principles
No ratings yet
SRE Principles
15 pages
Who Are The Jews
No ratings yet
Who Are The Jews
17 pages
A Guide To Become SRE
No ratings yet
A Guide To Become SRE
11 pages
Section 4
No ratings yet
Section 4
4 pages
Google SRE: Engineering Web Reliability
No ratings yet
Google SRE: Engineering Web Reliability
21 pages
Introduction + Unit 1 Unit 1 (Cont) Unit 1 (Cont) Unit 2 Unit 2 (Cont)
No ratings yet
Introduction + Unit 1 Unit 1 (Cont) Unit 1 (Cont) Unit 2 Unit 2 (Cont)
38 pages
Service Overview & Support Guide
No ratings yet
Service Overview & Support Guide
4 pages
41 - Sermon Outlines 2017
No ratings yet
41 - Sermon Outlines 2017
155 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
16 pages
SRE Job Description
No ratings yet
SRE Job Description
4 pages
SRE Blueprint: Mastering SLOs for Success
No ratings yet
SRE Blueprint: Mastering SLOs for Success
4 pages
Demonology 32893204
No ratings yet
Demonology 32893204
6 pages
SRE Paper
No ratings yet
SRE Paper
26 pages
System Design
No ratings yet
System Design
9 pages
The Songs of Yig, Edited by Allen Mackey
No ratings yet
The Songs of Yig, Edited by Allen Mackey
19 pages
Maquinas Electricas - Stephen Chapman - Ejercicios
100% (1)
Maquinas Electricas - Stephen Chapman - Ejercicios
22 pages
SRE Best Practices Guide
No ratings yet
SRE Best Practices Guide
11 pages
Cloud ITIL
No ratings yet
Cloud ITIL
92 pages
The Visual Guide To... Devops & Sre: Operating Systems Networking Ci/Cd Provisioning / Config Monitoring
No ratings yet
The Visual Guide To... Devops & Sre: Operating Systems Networking Ci/Cd Provisioning / Config Monitoring
1 page
Wabi, Sabi, and Shibui
No ratings yet
Wabi, Sabi, and Shibui
2 pages
English Sample Paper 4-1
No ratings yet
English Sample Paper 4-1
7 pages
LinkedIn's SRE Implementation Guide
No ratings yet
LinkedIn's SRE Implementation Guide
12 pages
CAD, Mechatronics
No ratings yet
CAD, Mechatronics
168 pages
18.reading Mysterious Creatures
No ratings yet
18.reading Mysterious Creatures
1 page
Women in Heart of Darkness
No ratings yet
Women in Heart of Darkness
4 pages
Mubtilaat e Namaz
No ratings yet
Mubtilaat e Namaz
5 pages
Echoes of The Red River
No ratings yet
Echoes of The Red River
2 pages
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
No ratings yet
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
8 pages
Scattering Theory
No ratings yet
Scattering Theory
1 page
ICT 204 - Lecture 4 Methods
No ratings yet
ICT 204 - Lecture 4 Methods
31 pages
Designing and Operating Highly Available Software Systems at Scale PDF
No ratings yet
Designing and Operating Highly Available Software Systems at Scale PDF
59 pages
Ruijie RG-S5300-E Series Gigabit 1
No ratings yet
Ruijie RG-S5300-E Series Gigabit 1
16 pages
SRE & Error Budgets for Reliability
No ratings yet
SRE & Error Budgets for Reliability
45 pages
Hutchinson Resume
No ratings yet
Hutchinson Resume
2 pages
Product Designer Role at Lenskart
No ratings yet
Product Designer Role at Lenskart
1 page
Arduino Motor Shield 2A
No ratings yet
Arduino Motor Shield 2A
6 pages
Microservices Architecture Guide
No ratings yet
Microservices Architecture Guide
14 pages
Zapotec Civilization
No ratings yet
Zapotec Civilization
8 pages
MM Migration Guide en
No ratings yet
MM Migration Guide en
9 pages
The Stolen Legacy Student's Name University Affiliation Course Number and Name Instructor Name Assignment Due Date
No ratings yet
The Stolen Legacy Student's Name University Affiliation Course Number and Name Instructor Name Assignment Due Date
6 pages
The SRE Report 2024 - Catchpoint
No ratings yet
The SRE Report 2024 - Catchpoint
59 pages
Advanced English Grammar Guide
No ratings yet
Advanced English Grammar Guide
3 pages
Exercise Grade 12
No ratings yet
Exercise Grade 12
7 pages
Vini Internship Report
No ratings yet
Vini Internship Report
37 pages

An Architect's Guide to SRE

Uploaded by

An Architect's Guide to SRE

Uploaded by

An Architect's Guide to Site

— Benjamin Treynor Sloss

You might also like