Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
146 views15 pages

Unit 05 - SRE

Site reliability engineering (SRE) uses software engineering practices to automate IT operations tasks like production system management, change management, and incident response that were traditionally done manually. The key principles of SRE include embracing and managing risk, defining service level objectives, eliminating manual toil through automation, continuous monitoring, release engineering, and simplicity. Common SRE practices involve using error budgets, defining service level objectives from a user perspective, monitoring for errors and availability, efficiently planning capacity, implementing change management processes, conducting blameless postmortems, and managing toil through automation. There are different models for implementing SRE teams including focusing on a single product, infrastructure, tools, embedding within development teams, or operating as consultants. SRE is
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views15 pages

Unit 05 - SRE

Site reliability engineering (SRE) uses software engineering practices to automate IT operations tasks like production system management, change management, and incident response that were traditionally done manually. The key principles of SRE include embracing and managing risk, defining service level objectives, eliminating manual toil through automation, continuous monitoring, release engineering, and simplicity. Common SRE practices involve using error budgets, defining service level objectives from a user perspective, monitoring for errors and availability, efficiently planning capacity, implementing change management processes, conducting blameless postmortems, and managing toil through automation. There are different models for implementing SRE teams including focusing on a single product, infrastructure, tools, embedding within development teams, or operating as consultants. SRE is
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SITE RELIABILITY

ENGINEERING (SRE)
UNIT - 05
Topics
➢Introduction to SRE

➢Principles of SRE

➢SRE Implementation

➢SRE Practices

➢SRE Vs DevOps

❑ Similarities
❑ Differences
What is SRE ?
• Site reliability engineering (SRE) uses software engineering to automate IT
operations tasks - e.g. production system management, change
management, incident response, even emergency response - that would
otherwise be performed manually by systems administrators (sysadmins).
• The principle behind SRE is that using software code to automate oversight
of large software systems is a more scalable and sustainable strategy than
manual intervention - especially as those systems extend or migrate to the
cloud.
Principles of SRE
• Embracing and Managing Risk
An SREs responsibility is to lean into failure and risk in order to learn how they can ultimately
make their services and systems more resilient.
• Service Level Objectives
The principle of embracing risk is closely tied to service level objects, or SLOs. To go a bit
deeper, SLOs are the formalize set of objectives within a service level agreement (SLA) that are
measured against service level indicators, or SLIs.
• Eliminate Toil
Toil, as it is defined with the scope of the SRE role, is the amount of manual work that is
required to ensure services are running.
• Monitoring
Monitoring is one of the most important SRE principles within the role. Continuous monitoring
ensures that services are performing as intended and can help identify the moment issues arise
so they can be fixed immediately.
Principles of SRE (Cont.d)
• Automation
The nature of the SRE role is as diverse as a role can be. In order to reduce the potential for
manual intervention across all facets of their responsibilities, automating tasks is key to a
successful business.
• Release Engineering
Release engineering. Sounds like a complex subject, but in reality, it is just a simple way to
define how software is built and delivered. While release engineering in itself is its own title
and role, within the concept of SRE, this means delivering services that are stable,
consistent, and of course, repeatable.
• Simplicity
With a position that has seemingly no end to the number of responsibilities and expectations
like the SRE role, the last principle, ironically is simplicity.
SRE Practices
• Error Budgets
In a nutshell, an error budget is the amount of error that your service can accumulate
over a certain period of time before your users start being unhappy. You can think of it as
the pain tolerance for your users but applied to a particular dimension of your service:
availability, latency, and so forth.
• Define SLOs Like a User
Measure availability and performance in terms that matter to an end-user. You can’t have
error budgets, prioritize development work, or do timely and effective incident
management without them. SLOs should specify how they’re measured and the
conditions under which they’re valid.
• Monitoring Errors and Availability
To identify performance errors and maintain service availability, SRE teams need to see
what’s going on in their systems. Monitoring is required to verify an application/system is
behaving as expected. This means a service, meeting specific goals, and understanding
what happens when a change is made.
SRE Practices (Cont.d)
• Efficiently Planning Capacity
Organizations need to plan for things like organic growth, which could be increased product
adoptions, inorganic growth, which comes from sudden jumps in demand due to feature
launches, marketing campaigns, etc.
• Paying Attention to Change Management
At many organizations, most outages are caused by changes to a live system, whether that’s
going to a new binary push or a new configuration push. Every little change impacts the
business. Therefore, analyze each change for the risk it carries. It should be supervised.
• Blameless Postmortem
A truly blameless postmortem culture helps to build a more reliable system in organizations.
Postmortems should be blameless and focus on process and technology, not people.
• Toil Management
One of the main focuses of SRE is automation. Toil is a waste of precious engineering time,
and by SREs creating frameworks, processes, internal tooling/building tooling to eliminate it,
engineers can get back to innovating.
SRE Implementation and Processes
Basically there are totally six models of SRE Implementation based on the scenerio
• Model 1: Kitchen Sink
In this model, a single SRE team must cover all processes in the organization. It is the most
widely used approach, and it allows the team to grow organically along with the business.
• Best Used : In smaller companies with a single or a couple of products and one or two
customer journeys. In this case, the SRE needs are present, but the scope is not enough to
justify more than a single dedicated SRE team.
• Model 2: Product/App
Such SRE teams dedicate their effort to improving the reliability of a single mission-critical
product or application at a time.
• Best used: By large companies that cannot cover the needs of all their products/services
with a single SRE team.
SRE Implementation and Processes
• Model 3: Infrastructure
Just like the DevOps teams, the infrastructure SRE teams are centered around improving the
job quality and performance of the rest of your business. Through automating repetitive
actions and removing structural and procedural bottlenecks, such teams speed up software
delivery.
• Best Used : In larger companies with several separate development teams as they will need
to issue common standards to uniform the processes across the board. The DevOps team
will handle CI/CD, testing automation, and product releases, while the SRE team should
ensure reliability.
• Model 4: Tools
Such SRE teams mostly concentrate on creating tools and features that help their fellow
developers be most productive. However, tool-centered SRE teams lack direct contact with
customer-facing reliability issues and might begin solving irrelevant problems.
• Best used: By any company in need of software tools not readily available through DevOps
or SaaS platforms.
SRE Implementation and Processes
• Model 5: Embedded
When SRE specialists are embedded within development teams, they usually perform hands-
on work like changing environment configurations to ensure maximum performance at every
step of the SDLC journey.
• Best Used : When starting an SRE journey to empower adoption and speed up
transformation. However, this is a limited time approach that must be later replaced with
other models.
• Model 6: Consulting
While being quite similar to the Embedded model, the Consulting SRE approach tends to
avoid actively changing the existing code and infrastructure configuration. Instead, such
specialists build tools that complement the existing processes.
• Best used: Before beginning your SRE implementation to get a grip of SRE best practices.
Alternatively, when your company is too large to cater to all its operational needs using only
the in-house SRE potential.
SRE & DevOps (Similarities)
• Both try to bridge gap between development team and the operations team
• Both share ownership of service with the developers
• Both believe in implementing gradual changes and follow change management
approaches like CI/CD
• Automation is an integral part of job for both
• Measurement is absolutely key to how both DevOps and SRE work.
• Both accept failures as normal and practice blameless postmortem
SRE Vs DevOps
SRE Vs DevOps (Cont.d)
THANK YOU
• References
• Why SRE? Principles and Practices for Your
Project | EPAM Anywhere Business
• Google - Site Reliability Engineering
(sre.google)
• What is Site Reliability Engineering (SRE) and
How to Build a Reliable Product
(relevant.software)

You might also like