100% found this document useful (1 vote)

332 views20 pages

SRE Essentials: Key Principles & Practices

The document discusses site reliability engineering (SRE) and how it fits into existing operations models like DevOps and ITIL. SRE aims to create reliable systems that satisfy customer expectations. It is based on a customer-first mentality and key tenets like ensuring engineering focus, pursuing change velocity without violating reliability standards, and monitoring, emergency response, and capacity planning. SRE complements DevOps by codifying practices to achieve DevOps goals and works with ITIL by helping different functions collaborate to improve customer happiness through reliability. The seven principles of the newest ITIL revision also align closely with SRE practices like optimizing workflows through automation and progressing iteratively with feedback.

Uploaded by

Alvaro Rodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

332 views20 pages

SRE Essentials: Key Principles & Practices

Uploaded by

Alvaro Rodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

The Essentials

Guide to SRE
Key principles and practices for
production teams
Table of Contents

3 Why Site Reliability Engineering

4 What is SRE?

5 Understanding How SRE Fits Into Your Operations Model

How SRE works with DevOps

How SRE works with ITIL

The seven principles of ITIL

10 Principle No. 1: Create a mindset of resiliency

Incident playbooks

Change management

Capacity planning

Postmortems best practices

15 Principle No. 2: Reduce engineering problems/innovation blockers

17 Principle No. 3: Approach systems from a human perspective

On-call & full service ownership practices

Keeping burnout at bay

Celebrating failure

20
Why Site Reliability Engineering
In the world of technology, the stakes have never been
higher. The move to the cloud and microservices to maximize
agility has given way to digital disruptors and unprecedented
competitive threats. As distributed systems become
increasingly complex, the scale of ‘unknown unknowns’
increases. On top of this, customer expectations are sky-high.
The cost of downtime is catastrophic, with customers willing to
churn if their needs are not promptly met. According to Gartner,
the average cost of downtime is $300,000 per hour. For some
companies, this number is considerably higher; for example ,
Amazon lost approximately $90 million during their Prime Day
outage in 2018, and the outage only lasted 75 minutes.

Organizations need to prioritize reliability so they can innovate

as quickly as possible on top of a strong foundation that won’t
compromise customer experience. This will become even more
critical as more businesses move toward distributed systems
with high reliability requirements. That’s where site reliability
engineering (SRE) comes in. The SRE function is growing
quickly (30-70% YoY growth in job listings), but there is not
enough skilled talent in the market to compensate. In other
words, it will be important to understand how you can not just
hire SREs, but grow your existing organization to adopt the
practices and mindsets required for production excellence.
With the shortage of SREs for hire, what can you do to ensure
your service’s reliability? To answer this question, you’ll need a
deeper understanding of what SRE actually is.
What is SRE?
SRE is a practice first coined by Google in 2003 that seeks to
create systems and services that are reliable enough to satisfy
customer expectations. Since then, many large organizations
such as LinkedIn and Netflix, have adopted SRE best practices.
In recent years, SRE has become more widely adopted by many
organizations globally, with the goal of reliability and resilience
in mind in light of exponentially growing customer expectations
as well as systems complexity.

SRE is based on a customer-first mentality. This means that SRE efforts are all tied to customer
satisfaction, even if the customers using the service are actually internal users. Each decision
should result in an increase in customer satisfaction. Teams work together to determine which
factors and experiences affect customer happiness, measure them, set goals, and balance
reliability requirements with the innovation velocity required to stay viable in an increasingly
competitive digital landscape.

To achieve this lofty goal, SREs and teams that have adopted SRE best practices refer to several
key tenets of SRE. According to Google, these include:

• Ensuring a durable focus on engineering

• Pursuing maximum change velocity without violating a service level

objective (SLO)

• Monitoring, including alerts, ticketing, and logging

• Emergency response

• Change management

• Demand forecasting and capacity planning

• Provisioning, and

• Efficiency and performance

4 Blameless The Essentials Guide to SRE

According to Forrester, 46% of the tenets can be applied out-of-the-box for most software
teams in the enterprise, but the rest require customizations or won’t make sense for the vast
majority of organizations. The important question to ask yourself is how these tenets fit in with
what you’re already doing, and how your teams can improve. We’ve got more answers below.

46%
of the tenets can be
applied out-of-the-box for
most software teams in
the enterprise, but the rest
require customizations.

5 Blameless The Essentials Guide to SRE

Understanding How SRE Fits Into
Your Operations Model
A common early mistake in adopting SRE best practices is
assuming that following SRE best practices means you’ll need
to rip and replace your current procedures, which simply isn’t
true. In fact, SRE can work as a complement to both DevOps and
ITIL methodologies. The trick is to ensure that regardless of your
organizations’ different operating models or toolchains, there
is shared visibility, communication, and collaboration across
teams. This will allow your disparate teams to stay aligned while
using the best practices from each methodology.

How SRE works with DevOps

Think of SRE as the practice that brings life to the DevOps philosophy. The core principles of
DevOps and SRE are nearly identical. According to Google’s Coursera course on SRE, “class
SRE implements DevOps,” the 5 DevOps principles are as follows:

1. Reduce organizational silos: SRE helps by sharing ownership across

developers and production teams, and unifying tooling.

2. Accept failure as normal: Blameless postmortems are an SRE best practice that
ensures that all incidents are used as learning opportunities. SRE also creates a
safe space and guardrails for failure through SLOs and error budgets.

3. Implement gradual change: This is done by canarying rollouts to a small subset

of customers before allowing all users to interact with new features. Smaller
changes are easier and safer to dissect and iterate on.

4. Leverage tooling and automation: SREs work to eliminate toil by measuring

it and creating automation to do repetitive tasks without needing human
intervention. This way, humans can focus on higher-value work.

6 Blameless The Essentials Guide to SRE

5. Measure everything: SRE specifically focuses on measuring toil and reliability
to make sure that both customers and software teams are happy with the
service.

With these common principles defined, it’s easy to see how SRE and DevOps fit really well
together, with SRE codifying practices that make it easier to achieve the promises of DevOps. But

how would SRE work alongside ITIL?

How SRE works with ITIL

In practice, ITIL and SRE can also make for a great combination. The first reason why is simple:
every organization wants happy customers, and ITIL and SRE can help different functions work
together to make that a reality. Embedding reliability throughout the software lifecycle can
ensure a higher rate of customer happiness. With the newest revision of ITIL, which introduces
seven guiding principles, SRE and ITIL align even more closely.

The seven principles of ITIL 4

1. Start Where You Are: Adopting SRE best practices is not one-size-fits-all,
and everyone starts somewhere. Taking the first steps and implementing and
iterating as you go is what matters most.

2. Keep it Simple and Practical: In the Google SRE book’s chapter on simplicity, it
states “Unlike just about everything else in life, ‘boring’ is actually a positive
attribute when it comes to software! We don’t want our programs to be
spontaneous and interesting; we want them to stick to the script and predictably
accomplish their business goals.” Simplicity in both software and business
operations streamlines communication, increases velocity, and helps ensure that
reliability isn’t compromised. Less is more.

3. Optimize and Automate: One of the goals of SRE is to automate toil-heavy

processes, and free up developer time to focus innovation instead of unplanned
work. This optimizes workflows and allows new features to ship faster.

4. Progress Iteratively with Feedback: SREs set alerts for the most important and
user-centric metrics. The metrics, alerts, and SLOs they’re tied to are all iterated
upon to better satisfy customer needs.

7 Blameless The Essentials Guide to SRE

5. Collaborate and Promote Visibility: SRE is culturally collaborative. It focuses on a
blameless work culture that values learning from failure, and trusting that each team
member is doing what he or she thinks is best for the organization.

6. Focus on Value: Without customers, there is no value in software. Business value

is created when customers want, and get, what they need from a product. SRE
best practices ensure that the product is reliable enough to provide value to the
customers, and also protect the most important customer journeys. Thus, they
provide significant value to the organization in helping to drive shared focus.

7. Think and Work Holistically: By breaking down silos and focusing on scalability and
reliability on a holistic level, SREs are able to provide significant benefits in maturing
the organization. Business-wide success is in the hands of every team member, and
SREs work to make sure that the company’s product, systems, and procedures are
resilient enough to not just meet but exceed customer standards.

For a visual on how SRE, DevOps, and ITIL’s best practices can be used in conjunction with each
other, we created a handy graph.

ITIL DevOps SRE

Align IT with business Improve teamwork and Eliminate toil, design for
Philosophy & Culture

needs to create a symbiotic eliminate silos operability

relationship
Aims to create alignment Treats operations as
Command-and-control and and minimize silos a software problem to
process-driven to mitigate between development and maximize efficiency
risk operations
Ideal to support distributed
Often oriented toward services at scale that need
helping teams improve to be hyper-reliable
velocity and quality of
deploys
Key Practices & Tooling

Capacity planning Capacity planning Same as DevOps key

Service catalog / CMDB On-call practices, as well as :
Problem management Microservices Progressive rollouts
Change management / CI/CD SLOs & error budgets
advisory board Infra as code Observability
Monitoring and logging Chaos engineering
Comms & collaboration

8 Blameless The Essentials Guide to SRE

ITIL DevOps SRE

Traditional model of Dev and ops increasingly SREs often act as

Teamwork

centralized process and share the same process consultants to establish

visibility. Work is typically and tooling throughout the reliability-oriented practices
queued (‘waterfall’). entire service lifecycle.
Software Eng and SREs’
Incidents routed through Typically, this means devs roles converge, aligning
central NOC team go on-call for what they around shared process and
build, but may engage ops outcomes
for L2 support
Key Measures

Availability, # incidents, # Availability, deployment SLOs as well as availability,

escalations, etc. frequency, etc. deployment frequency, etc.
Error budgets

Whether you identify as a DevOps or ITIL shop, your organization has something to gain by
following the principles of SRE. Let’s dive into what exactly these principles entail.

9 Blameless The Essentials Guide to SRE

Principle No. 1: Create a mindset
of resiliency
Relisilency isn’t something that just happens; it’s a result
of dedication and hard work. To reach your optimal state of
resilience, there are some crucial SRE best practices you should
adopt to strengthen your processes.

Incident playbooks
As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong,
especially with growing systems complexity and reliance on third-party service providers. You’ll
need to be prepared to make the right decisions fast. There’s nothing worse than being called in
the wee hours of a Sunday morning to handle a situation where thousands of dollars are going
down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the
extreme pressure of a critical incident. In these cases (and really, all cases where an incident is
involved), incident playbooks can help guide you through the process and maximize the use of time.

According to Chris Taylor at Taksati Consulting, good incident playbooks help you cover all your
bases. They typically include flowcharts and checklists to depict both the big picture and the
minute details, a RACI (responsible, accountable, consulted, informed) chart for each step, and a list
of environmental influences that are unique to your system. To create your incident playbook, Chris
recommends aggregating the following information:

• An inventory of relevant tools

• The right personnel/subject matter experts to engage in response

• Knowing the problem to solve, or the workflow you’re trying to document

• Current state (whether this is a new process, or updating and old one)

By developing incident playbooks and practicing running through them, you’ll be

more prepared for the inevitable.

10 Blameless The Essentials Guide to SRE

Change management
Change management is often done haphazardly, if at all. This means that organizations are unable
to manage the risk of pushing new code, possibly leading to more incidents. Rather than employ
ITIL’s arduous CAB method, SRE seeks to empower teams to push code according to their own
schedule while still managing risk. To do this, SRE uses SLOs and error budgets.

SLOs, or service level objectives, are internal goals for service availability and speed which are set
according to customer needs. These SLOs serve as a benchmark for safety. Each month, you have
a certain allowable amount of downtime determined by your SLO. You can use this downtime to
push new features. If a feature is at risk for exceeding your error budget, it cannot be pushed until
the next window. If the feature is low to no risk to your SLO, then you can push it. Each month
teams should aspire to use the entirety, but not exceed, their error budgets. This way, your
organization can optimize for innovation, but do so safely without risking unacceptable levels of
customer impact.

Capacity planning
Black Friday outages, scaling, moving to cloud. All of these big events required heightened
capacity planning. If you don’t have enough load balancers on Black Friday or Cyber Monday, you
might be sunk. Or, if your company is simply growing quickly, you’ll need to adopt best practices to
make sure that your team has everything it needs to be successful. There are two types of demand
that require additional capacity: the first is organic demand (this is your organization’s natural
growth) and inorganic demand (this is the growth that happens due to a marketing campaign or
holiday. To prepare for these events, you’ll need to forecast the demand and plan time for acquisition.

Important facets of capacity planning include regular load testing and accurate provisioning.
Regular load testing allows you to see how your system is operating under the average strain
of daily users. As Google SRE Stephen Thorne writes, “It’s important to know that when you reach
boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so
sometimes it’s important to know where those limits are.” If your service is struggling to load
balance, or the CPU usage is through the roof, you know that you’ll need to add capacity in the
event of increased demand. That’s where provisioning comes in.

Adding capacity in any form can be expensive, so knowing where you need additional resources
is key. It’s important to routinely plan for inorganic demand so you have time to provision correctly.
The process of adding capacity can sometimes be a lengthy effort, especially if it’s the case
of moving to cloud. You’ll also need to know how many hands you’ll need on deck for these
momentous occasions.

11 Blameless The Essentials Guide to SRE

Resiliency doesn’t just
exist in your processes,
it also exists in your
people.

Capacity planning is an important part of having a resilient system because in thinking about the
allocation of resources, your team members matter. They need time off for holidays, personal
vacations, and the obligatory annual cold. When you fail to plan for time off, you won’t have enough
hands on deck to handle incidents as they occur. Denying people time off is obviously not the
answer, as that will only lead to burnout and churn. So it’s important to develop a capacity plan that
can accommodate people being, well, people.

Johann Strasser shares four steps you can take to develop a capacity plan that will eliminate
staffing insecurity:

1. Establish all necessary processes with the appropriate staff – from top
management to team leaders. Decide how often you will need to revise/revisit
this process and make sure that everyone is in agreement on this.

2. Provide for complete and up-to-date project data and prioritize your projects.
What projects are the most important, and which can be put on the back burner
for now? Additionally, how long will each project take? You’ll need this data to
be able to move forward with accurate plans.

3. Identify the capacities across your existing team, as well as your infrastructure
and services. Is the team equipped and system architected in a way that
minimizes performance regressions, to protect efficiency and capacity?

4. Consolidate the requirements (step 2) and the capacities (step 3). Identify
underload as well as overload and try to balance them.

So, now you’ve got the people and the process, but how can you learn and improve on
your resilience? For that, you’ll need great postmortem practices in place that facilitate real
introspection, psychological safety, and forward-looking accountability.

12 Blameless The Essentials Guide to SRE

Postmortems best practices
When something goes wrong, it’s important to learn from it to prevent the same mistake from
happening again. To do this, it’s important to craft and analyze postmortems (or post-incident
reviews, RCA reports, or whatever you like to call them). To have postmortems worthy of analysis,
applying SRE best practices will be key. In fact, postmortems are a great place to begin your SRE
adoption journey.

As Steve McGhee, SRE Leader at Google, shares, “Conducting blameless postmortems will
enable you to see gaps in your current monitoring as well as operational processes. Armed with
better monitoring, you will find it easier and faster to detect, triage, and resolve incidents. More
effective incident resolution will then free up time and mental bandwidth for more in-depth
learning during postmortems, leading to even better monitoring.

In other words, building a postmortem practice

will eventually enable you to identify and
tackle classes of issues, including fixing deeply
rooted technical debt. With time, you’ll be able
to practice SRE, directly improving the systems
continuously.”

One of the most important elements of a postmortem, and of SRE as a whole, is the notion of
blamelessness. To learn from postmortems, there needs to be total transparency. Opening up
about mistakes can often be frightening, and requires a psychologically safe space to do so.
Positive intent should always be assumed in order to foster the trust that allows for true openness.
Blaming team members or defining people as the root cause for failure will only lead to more
insecurity, covering up the important truths that postmortems are meant to uncover.

To craft great postmortems, there are four other best practices that will ensure your incidents are
being used to their full advantage:

• Use visuals in your postmortems: As Steve McGhee says, “A ‘what

happened’ narrative with graphs is the best textbook-let for teaching other
engineers how to get better at progressing through future incidents.” Graphs
provide an engineer with a quickly readable yet in-depth explanation for what
was happening during the incident days, weeks, or even years later.

13 Blameless The Essentials Guide to SRE

• Be a historian: Timelines can be invaluable for parsing through a particularly
dense incident. Chat logs can be cluttered, and it’s difficult to quickly find
what you’re looking for. Thus, it’s important to have a centralized timeline that
gives a clean, clear summary of the events. This also provides the context that
helps relevant team members analyze what happened.

• Tell a story: An incident is a story. To tell a story well, many components must
work together. Without sufficient background knowledge, this story loses
depth and context. Without a timeline dictating what happened during an
incident, the story loses its plot. Without a plan to rectify outstanding action
items, the story loses a resolution.

• Publish promptly: Promptness has two main benefits: first, it allows the
authors of the postmortem to report on the incident with a clear mind, and
second, it soothes affected customers. Best-in-class companies like Google,
Uber, and others have internal SLOs around publishing their postmortems
within 48 hours.

Creating incident playbooks, utilizing change management and capacity planning, and following
postmortem best practices will all contribute to your system’s resilience, but that’s not all that SRE
seeks to do.

14 Blameless The Essentials Guide to SRE

Principle No. 2: Reduce
engineering problems/
innovation blockers
Happy engineers means happy customers, as engineers won’t
build the best products possible without support from the
organization. There are two majors ways that SRE can help
brighten engineering’s day.

1. Toil: One of the main focuses of SRE is automation.Toil is a waste of precious

engineering time, and by SREs creating frameworks, processes, internal
tooling/building tooling to eliminate it, engineers can get back to innovating.

2. Elimination of tech debt: SREs create accountability around postmortem

follow-up action items to make sure that old issues aren’t buried under new
code. SREs also put together frameworks to help developers deliver more
performant code, prioritizing what matters most to the customer experience.
Pinpointing the tech debt build-up that hurts customer experience is important
to guide refactoring initiatives and other practices to reduce tech debt. This
establishes a baseline for healthy engineering practices to help minimize future
accrual of tech debt.

Additionally, SREs invest in cultural change that prevents more tech debt from accruing in the
future, while still making way for innovation. Jean Hsu wrote about her experience refactoring
Medium’s codebase, and realized that the most important thing she could do for her team wasn’t
just to fix spaghetti code; it was to create a culture that fixes technical debt as it goes along,
deleting old code as needed. Jean wrote “I realized that if I always did this type of work myself, I
would be constantly refactoring, and the rest of the team would take away the lesson that I'd clean
up after them. Though I did enjoy it myself, I really wanted to foster a long-term culture where
engineers felt pride and ownership over this type of work.”

15 Blameless The Essentials Guide to SRE

SREs are often the cultural drivers for this sort of work, improving the way engineering teams
function as a whole rather than simply going from project to project fixing bugs. These changes are
long-term initiatives that spark growth and adoption of best practices for the entire organization.

As you can see, SRE could positively impact

each engineer’s day-to-day productivity. In
fact, SRE is not about tooling or job titles, and
is rather a more human-centric approach to
systems as a whole.

With this context in mind, adoption brings positive business benefits for everyone in the organization.

16 Blameless The Essentials Guide to SRE

Principle No. 3: Approach
systems from a human
perspective
Resiliency engineering as a practice looks at systems holistically,
considering not only infrastructure but also human, process,
and cultural factors. Without adopting the culture and mindset
behind SRE, you’ll simply have new processes with no uniting
value at the center to keep the initiative in place. Focusing on
the human approach to systems requires reevaluating your
organization’s attitude towards three crucial things.

On-call & full service ownership practices

The notion of on-call is important in SRE for several reasons. It establishes clear ownership to
ensure software problems are immediately addressed, and inherently incentivizes developers to
ship more performant code. But while going on-call is now a fairly common practice, establishing
a healthy, balanced process is crucial to prevent burnout. Nobody can be on-call 24/7, especially
when incidents during the on-call period actively disrupt engineers' personal lives. People
need uninterrupted time away from work to be at their best, so on-call responsibilities need to
be carefully monitored. If someone is waking up at 2 AM every night for a full month, there’s
something wrong; it’s simply unsustainable. Additionally, more than one person should have
to carry the burden. The whole development team should be empowered to be on-call so the
responsibility becomes a shared one. This also incentivizes developers to ship better code to avoid
getting woken up at 2 AM.

SRE best practices encourage a better proactive system, with a robust reactive system in place.
Being proactive means fostering a community of constant learning and improvement. When
your engineers are better prepared and learning from previous incidents, it’s less likely that the
same mistakes will be made again. This lowers the amount of incidents occurring as your SRE

17 Blameless The Essentials Guide to SRE

practice matures. From a reactive perspective, better incident management practices can allow
for streamlined communication during an incident, and provide a foundation to treat incidents as
‘unplanned investments’ as they become important learning opportunities. Postmortems thus give
engineers a place to begin looking when the root cause of an incident is evading them. SRE gives
those who hold the pager more agency.

Keeping burnout at bay

Constant firefighting, especially with a tough on-call schedule, can leave engineers feeling burnt
out. Over time, burnout leads to high turnover rates, meaning the senior engineering will need to
pick up additional slack while new hires are ramped up. This only increases burnout, leading to a
vicious cycle of dissatisfied engineers who have little capacity to think about improvements, and
new hires who are clueless about where to begin.

In this situation, the SRE approach would encourage improved visibility into engineering hours,
on-call periods, and repeat incidents. Each of these issues directly contributes to burnout, yet many
organizations aren’t tracking them. By knowing which engineers have spent abnormally high hours
over an extended period of time, team leads can suggest vacation time to curb burnout. Knowing
who has been on-call every weekend for the last month allows teams to better manage the rotation
so everyone gets a break. Monitoring repeat incidents and incidents of a similar class can give
insight into what’s burning through engineering hours, as well as whether previous postmortems
uncovered improvements or follow-up items that were not taken action on. These are issues
that should promptly be fixed in order to give teams a break from firefighting, and more time for
strategic work.

Celebrating failure
Failure will happen, incidents will occur, and SLOs will be breached. These things may be difficult
to face, but part of adopting SRE is to acknowledge that they are the norm. Systems are made
by humans, and humans are imperfect. What’s important is learning from these failures and
celebrating the opportunity to grow.

One way to foster this culture is to prioritize psychological safety in the workplace. The power of
safety is very obvious, but often overlooked. Industry thought leaders like Gene Kim have been
promoting the importance of feeling safe to fail. He addresses the issue of psychological insecurity
in his novel, “The Unicorn Project.” Main character Maxine has been shunted from a highly-

18 Blameless The Essentials Guide to SRE

functional team to Project Phoenix, where mistakes are punishable by firing. Gene writes “She’s
[Maxine] seen the corrosive effects that a culture of fear creates, where mistakes are routinely
punished and scapegoats fired. Punishing failure and ‘shooting the messenger’ only cause people
to hide their mistakes, and eventually, all desire to innovate is completely extinguished.”

Getting the most out of your teams and systems cannot be achieved if blame exists. Blamelessness
is at the core of SRE. To fully adopt this practice, you need to acknowledge that people are not a
source of failure. Each team member is simply doing their best with the knowledge at hand, making
the decisions they believe are right and in the best interests of the organization. Punishment or
blame takes away the desire to try, fix, and continuously learn.

Fear is an innovation killer, but failure is an innovation inspiration. Creating safety and trust within
your organization is key to fully realizing and unleashing your team’s potential.

19 Blameless The Essentials Guide to SRE

Begin your SRE journey today
Any organization can adopt SRE best practices, and it can begin
in small increments. The most important change you will make
will be the cultural one. As organizations are made of people,
any organization can foster continuous learning, blameless
culture, and psychological safety so long as its people are
committed to a growth mindset. Once these cultural factors are
in place, it becomes much easier to implement the practices,
processes, and tools that scale that culture of excellence.

And if you need a guiding hand along the way, remember

Blameless is here for you.

GMP Dolphin-G Series
0% (1)
GMP Dolphin-G Series
1 page
Sand Handling & Disposal Guide
0% (1)
Sand Handling & Disposal Guide
20 pages
Specification For Concrete Crack Repair
100% (1)
Specification For Concrete Crack Repair
12 pages
DevOps & DevSecOps for Enterprises
No ratings yet
DevOps & DevSecOps for Enterprises
41 pages
Google SRE Interview Prep Guide
No ratings yet
Google SRE Interview Prep Guide
5 pages
DevOps Foundation Course Catalogue PDF
100% (1)
DevOps Foundation Course Catalogue PDF
5 pages
DevOps Guide for IT Professionals
100% (3)
DevOps Guide for IT Professionals
120 pages
DevSecOps Reference Architecture Guide
No ratings yet
DevSecOps Reference Architecture Guide
1 page
DevOps & Cloud Engineering Expertise
No ratings yet
DevOps & Cloud Engineering Expertise
7 pages
DevOps HandBook
100% (1)
DevOps HandBook
18 pages
Recruitment & Selection Process at Vodafone
50% (2)
Recruitment & Selection Process at Vodafone
73 pages
Developing A Google SRE Culture
100% (1)
Developing A Google SRE Culture
25 pages
LinkedIn's SRE Implementation Guide
No ratings yet
LinkedIn's SRE Implementation Guide
12 pages
Site Reliability Engineering Ebook
100% (2)
Site Reliability Engineering Ebook
21 pages
What Is SRE
100% (1)
What Is SRE
40 pages
SRE Google Notes
100% (1)
SRE Google Notes
8 pages
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
100% (3)
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
18 pages
SRE Principles
No ratings yet
SRE Principles
15 pages
DevOps Learning Guide: Essential Skills & Resources
No ratings yet
DevOps Learning Guide: Essential Skills & Resources
16 pages
Introduction To Kubernetes
No ratings yet
Introduction To Kubernetes
182 pages
Site Reliability Engineer (SRE) v1
50% (2)
Site Reliability Engineer (SRE) v1
3 pages
DSOF v2.1 Exam Study Guide - Sep2021
100% (2)
DSOF v2.1 Exam Study Guide - Sep2021
90 pages
Training Sre PDF
100% (1)
Training Sre PDF
116 pages
Devops Tutorial: Complete Beginners Training
100% (3)
Devops Tutorial: Complete Beginners Training
16 pages
Site Reliability Engineering (SRE)
No ratings yet
Site Reliability Engineering (SRE)
4 pages
CI CD Workshop 20200322 v2.0
100% (1)
CI CD Workshop 20200322 v2.0
18 pages
DevOps Essential 2
100% (2)
DevOps Essential 2
122 pages
1.2. DevOps Essentials
No ratings yet
1.2. DevOps Essentials
17 pages
JLR Site Reliability Engineer Role
No ratings yet
JLR Site Reliability Engineer Role
5 pages
Top DevOps Interview Questions
No ratings yet
Top DevOps Interview Questions
45 pages
Terraform v2
100% (1)
Terraform v2
17 pages
Introduction To Devops
100% (3)
Introduction To Devops
68 pages
DevOps Interview Questions
100% (1)
DevOps Interview Questions
6 pages
CI CD PipeLine
100% (1)
CI CD PipeLine
7 pages
Devops Essentials
100% (1)
Devops Essentials
116 pages
Devops Vs Agile: The Differences Between The Two Are Listed Down in The Table Below
100% (2)
Devops Vs Agile: The Differences Between The Two Are Listed Down in The Table Below
39 pages
Beginner Guide Gitops
No ratings yet
Beginner Guide Gitops
18 pages
Azure DevOps Engineer Learning Pathway 1122i
100% (1)
Azure DevOps Engineer Learning Pathway 1122i
1 page
Building A Cloud Operating Model: July 2020
No ratings yet
Building A Cloud Operating Model: July 2020
24 pages
DevOps Use Cases
100% (3)
DevOps Use Cases
19 pages
DevOps Acceleration with OpenShift
No ratings yet
DevOps Acceleration with OpenShift
29 pages
Kubernetes: A Comprehensive Overview
100% (1)
Kubernetes: A Comprehensive Overview
67 pages
Essential Guide to CI/CD Implementation
No ratings yet
Essential Guide to CI/CD Implementation
2 pages
Devops in Banks
No ratings yet
Devops in Banks
10 pages
AWS DevOps Interview Questions
No ratings yet
AWS DevOps Interview Questions
5 pages
Cloud Adoption for IT Leaders
100% (1)
Cloud Adoption for IT Leaders
36 pages
Strengthen and Scale Security Using Devsecops: Owasp Indonesia Meetup
No ratings yet
Strengthen and Scale Security Using Devsecops: Owasp Indonesia Meetup
44 pages
Managing DevOps Release
100% (1)
Managing DevOps Release
30 pages
Axis DevSecOps Training Batch-5
100% (2)
Axis DevSecOps Training Batch-5
71 pages
AWS-DevOps-DevOps Best Practices
No ratings yet
AWS-DevOps-DevOps Best Practices
12 pages
Devops Interview Questions & Answers
No ratings yet
Devops Interview Questions & Answers
110 pages
Cloud Computin G: Sanjay Gandhi Institute of Engineering & Technology
No ratings yet
Cloud Computin G: Sanjay Gandhi Institute of Engineering & Technology
27 pages
DevOps Automation OpenShift (English)
100% (3)
DevOps Automation OpenShift (English)
96 pages
AWS Reference Architecture Humanitec
No ratings yet
AWS Reference Architecture Humanitec
36 pages
Azure DevOps Course Overview
No ratings yet
Azure DevOps Course Overview
5 pages
Google Cloud DevOps and SREs PDF - 1594058313
No ratings yet
Google Cloud DevOps and SREs PDF - 1594058313
60 pages
DevSecOps Lead Job Description
No ratings yet
DevSecOps Lead Job Description
2 pages
DevOps Engineer Profile
No ratings yet
DevOps Engineer Profile
1 page
DevOps Culture and Practice With Openshift Section4
No ratings yet
DevOps Culture and Practice With Openshift Section4
82 pages
Ebook DevSecOps
No ratings yet
Ebook DevSecOps
13 pages
SRE Insights for Google Cloud Users
No ratings yet
SRE Insights for Google Cloud Users
58 pages
Unit 05 - SRE
No ratings yet
Unit 05 - SRE
15 pages
Site Reliability Engineering v2
No ratings yet
Site Reliability Engineering v2
115 pages
Developing A SRE Culture-English
No ratings yet
Developing A SRE Culture-English
4 pages
Pci Leasing and Finance
No ratings yet
Pci Leasing and Finance
6 pages
2009 Chamber Membership List
100% (1)
2009 Chamber Membership List
2 pages
LAB 1 - Matlab Basic
100% (1)
LAB 1 - Matlab Basic
26 pages
Physical Planning (Physical Planners) Regulations 2020
No ratings yet
Physical Planning (Physical Planners) Regulations 2020
66 pages
Abl90 Manual Operação
No ratings yet
Abl90 Manual Operação
59 pages
Bro vd10 20140115
No ratings yet
Bro vd10 20140115
2 pages
COP WFP CHK 01 2013 v1 All Checklists
100% (1)
COP WFP CHK 01 2013 v1 All Checklists
47 pages
Nature and Scope of Rural Development
No ratings yet
Nature and Scope of Rural Development
59 pages
IRCTC Train Ticket: Rourkela to Surat
No ratings yet
IRCTC Train Ticket: Rourkela to Surat
3 pages
Lab Report On Basics Logic Gate
80% (10)
Lab Report On Basics Logic Gate
9 pages
Sage X3 Server Sizing Guide
No ratings yet
Sage X3 Server Sizing Guide
6 pages
University of Cambridge International Examinations International General Certificate of Secondary Education
0% (1)
University of Cambridge International Examinations International General Certificate of Secondary Education
109 pages
The Critical Succesfactor of The Client Consultant Relationship
No ratings yet
The Critical Succesfactor of The Client Consultant Relationship
26 pages
Employee Rights in Bereavement Cases
No ratings yet
Employee Rights in Bereavement Cases
1 page
Cold Storage: Air Coolers vs. Bunker Coils
No ratings yet
Cold Storage: Air Coolers vs. Bunker Coils
6 pages
Go Ahead PDF
No ratings yet
Go Ahead PDF
2 pages
Oilfield Chemical Solutions
No ratings yet
Oilfield Chemical Solutions
13 pages
57 Brochure
No ratings yet
57 Brochure
42 pages
HIRA Night Works
No ratings yet
HIRA Night Works
13 pages
Ambassador SWOT Examples
No ratings yet
Ambassador SWOT Examples
18 pages
Wireless World 1983 03
No ratings yet
Wireless World 1983 03
126 pages
BJT AC Analysis for Electronics Students
100% (1)
BJT AC Analysis for Electronics Students
9 pages
Configuring The Network Settings
No ratings yet
Configuring The Network Settings
23 pages
Mahendra Engineering College
No ratings yet
Mahendra Engineering College
2 pages
Demand Analysis of Maggi
83% (6)
Demand Analysis of Maggi
8 pages
IELTS Listening Test 122
No ratings yet
IELTS Listening Test 122
6 pages

SRE Essentials: Key Principles & Practices

Uploaded by

SRE Essentials: Key Principles & Practices

Uploaded by

The Essentials

3 Why Site Reliability Engineering

5 Understanding How SRE Fits Into Your Operations Model

How SRE works with DevOps

How SRE works with ITIL

The seven principles of ITIL

10 Principle No. 1: Create a mindset of resiliency

Postmortems best practices

15 Principle No. 2: Reduce engineering problems/innovation blockers

17 Principle No. 3: Approach systems from a human perspective

On-call & full service ownership practices

Keeping burnout at bay

Organizations need to prioritize reliability so they can innovate

• Ensuring a durable focus on engineering

• Pursuing maximum change velocity without violating a service level

• Monitoring, including alerts, ticketing, and logging

• Demand forecasting and capacity planning

• Efficiency and performance

4 Blameless The Essentials Guide to SRE

5 Blameless The Essentials Guide to SRE

How SRE works with DevOps

1. Reduce organizational silos: SRE helps by sharing ownership across

3. Implement gradual change: This is done by canarying rollouts to a small subset

4. Leverage tooling and automation: SREs work to eliminate toil by measuring

6 Blameless The Essentials Guide to SRE

how would SRE work alongside ITIL?

How SRE works with ITIL

The seven principles of ITIL 4

3. Optimize and Automate: One of the goals of SRE is to automate toil-heavy

7 Blameless The Essentials Guide to SRE

6. Focus on Value: Without customers, there is no value in software. Business value

ITIL DevOps SRE

needs to create a symbiotic eliminate silos operability

Capacity planning Capacity planning Same as DevOps key

8 Blameless The Essentials Guide to SRE

Traditional model of Dev and ops increasingly SREs often act as

centralized process and share the same process consultants to establish

Availability, # incidents, # Availability, deployment SLOs as well as availability,

9 Blameless The Essentials Guide to SRE

• An inventory of relevant tools

• The right personnel/subject matter experts to engage in response

• Knowing the problem to solve, or the workflow you’re trying to document

By developing incident playbooks and practicing running through them, you’ll be

10 Blameless The Essentials Guide to SRE

11 Blameless The Essentials Guide to SRE

12 Blameless The Essentials Guide to SRE

In other words, building a postmortem practice

• Use visuals in your postmortems: As Steve McGhee says, “A ‘what

13 Blameless The Essentials Guide to SRE

14 Blameless The Essentials Guide to SRE

1. Toil: One of the main focuses of SRE is automation.Toil is a waste of precious

2. Elimination of tech debt: SREs create accountability around postmortem

15 Blameless The Essentials Guide to SRE

As you can see, SRE could positively impact

16 Blameless The Essentials Guide to SRE

On-call & full service ownership practices

17 Blameless The Essentials Guide to SRE

Keeping burnout at bay

18 Blameless The Essentials Guide to SRE

19 Blameless The Essentials Guide to SRE

And if you need a guiding hand along the way, remember

You might also like