Unit 05 - SRE

Site reliability engineering (SRE) uses software engineering practices to automate IT operations tasks like production system management, change management, and incident response that were traditionally done manually. The key principles of SRE include embracing and managing risk, defining service level objectives, eliminating manual toil through automation, continuous monitoring, release engineering, and simplicity. Common SRE practices involve using error budgets, defining service level objectives from a user perspective, monitoring for errors and availability, efficiently planning capacity, implementing change management processes, conducting blameless postmortems, and managing toil through automation. There are different models for implementing SRE teams including focusing on a single product, infrastructure, tools, embedding within development teams, or operating as consultants. SRE is

Uploaded by

firozakhatoon Sheik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views15 pages

Unit 05 - SRE

Uploaded by

firozakhatoon Sheik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

SITE RELIABILITY

ENGINEERING (SRE)
UNIT - 05
Topics
➢Introduction to SRE

➢Principles of SRE

➢SRE Implementation

➢SRE Practices

➢SRE Vs DevOps

❑ Similarities
❑ Differences
What is SRE ?
• Site reliability engineering (SRE) uses software engineering to automate IT
operations tasks - e.g. production system management, change
management, incident response, even emergency response - that would
otherwise be performed manually by systems administrators (sysadmins).
• The principle behind SRE is that using software code to automate oversight
of large software systems is a more scalable and sustainable strategy than
manual intervention - especially as those systems extend or migrate to the
cloud.
Principles of SRE
• Embracing and Managing Risk
An SREs responsibility is to lean into failure and risk in order to learn how they can ultimately
make their services and systems more resilient.
• Service Level Objectives
The principle of embracing risk is closely tied to service level objects, or SLOs. To go a bit
deeper, SLOs are the formalize set of objectives within a service level agreement (SLA) that are
measured against service level indicators, or SLIs.
• Eliminate Toil
Toil, as it is defined with the scope of the SRE role, is the amount of manual work that is
required to ensure services are running.
• Monitoring
Monitoring is one of the most important SRE principles within the role. Continuous monitoring
ensures that services are performing as intended and can help identify the moment issues arise
so they can be fixed immediately.
Principles of SRE (Cont.d)
• Automation
The nature of the SRE role is as diverse as a role can be. In order to reduce the potential for
manual intervention across all facets of their responsibilities, automating tasks is key to a
successful business.
• Release Engineering
Release engineering. Sounds like a complex subject, but in reality, it is just a simple way to
define how software is built and delivered. While release engineering in itself is its own title
and role, within the concept of SRE, this means delivering services that are stable,
consistent, and of course, repeatable.
• Simplicity
With a position that has seemingly no end to the number of responsibilities and expectations
like the SRE role, the last principle, ironically is simplicity.
SRE Practices
• Error Budgets
In a nutshell, an error budget is the amount of error that your service can accumulate
over a certain period of time before your users start being unhappy. You can think of it as
the pain tolerance for your users but applied to a particular dimension of your service:
availability, latency, and so forth.
• Define SLOs Like a User
Measure availability and performance in terms that matter to an end-user. You can’t have
error budgets, prioritize development work, or do timely and effective incident
management without them. SLOs should specify how they’re measured and the
conditions under which they’re valid.
• Monitoring Errors and Availability
To identify performance errors and maintain service availability, SRE teams need to see
what’s going on in their systems. Monitoring is required to verify an application/system is
behaving as expected. This means a service, meeting specific goals, and understanding
what happens when a change is made.
SRE Practices (Cont.d)
• Efficiently Planning Capacity
Organizations need to plan for things like organic growth, which could be increased product
adoptions, inorganic growth, which comes from sudden jumps in demand due to feature
launches, marketing campaigns, etc.
• Paying Attention to Change Management
At many organizations, most outages are caused by changes to a live system, whether that’s
going to a new binary push or a new configuration push. Every little change impacts the
business. Therefore, analyze each change for the risk it carries. It should be supervised.
• Blameless Postmortem
A truly blameless postmortem culture helps to build a more reliable system in organizations.
Postmortems should be blameless and focus on process and technology, not people.
• Toil Management
One of the main focuses of SRE is automation. Toil is a waste of precious engineering time,
and by SREs creating frameworks, processes, internal tooling/building tooling to eliminate it,
engineers can get back to innovating.
SRE Implementation and Processes
Basically there are totally six models of SRE Implementation based on the scenerio
• Model 1: Kitchen Sink
In this model, a single SRE team must cover all processes in the organization. It is the most
widely used approach, and it allows the team to grow organically along with the business.
• Best Used : In smaller companies with a single or a couple of products and one or two
customer journeys. In this case, the SRE needs are present, but the scope is not enough to
justify more than a single dedicated SRE team.
• Model 2: Product/App
Such SRE teams dedicate their effort to improving the reliability of a single mission-critical
product or application at a time.
• Best used: By large companies that cannot cover the needs of all their products/services
with a single SRE team.
SRE Implementation and Processes
• Model 3: Infrastructure
Just like the DevOps teams, the infrastructure SRE teams are centered around improving the
job quality and performance of the rest of your business. Through automating repetitive
actions and removing structural and procedural bottlenecks, such teams speed up software
delivery.
• Best Used : In larger companies with several separate development teams as they will need
to issue common standards to uniform the processes across the board. The DevOps team
will handle CI/CD, testing automation, and product releases, while the SRE team should
ensure reliability.
• Model 4: Tools
Such SRE teams mostly concentrate on creating tools and features that help their fellow
developers be most productive. However, tool-centered SRE teams lack direct contact with
customer-facing reliability issues and might begin solving irrelevant problems.
• Best used: By any company in need of software tools not readily available through DevOps
or SaaS platforms.
SRE Implementation and Processes
• Model 5: Embedded
When SRE specialists are embedded within development teams, they usually perform hands-
on work like changing environment configurations to ensure maximum performance at every
step of the SDLC journey.
• Best Used : When starting an SRE journey to empower adoption and speed up
transformation. However, this is a limited time approach that must be later replaced with
other models.
• Model 6: Consulting
While being quite similar to the Embedded model, the Consulting SRE approach tends to
avoid actively changing the existing code and infrastructure configuration. Instead, such
specialists build tools that complement the existing processes.
• Best used: Before beginning your SRE implementation to get a grip of SRE best practices.
Alternatively, when your company is too large to cater to all its operational needs using only
the in-house SRE potential.
SRE & DevOps (Similarities)
• Both try to bridge gap between development team and the operations team
• Both share ownership of service with the developers
• Both believe in implementing gradual changes and follow change management
approaches like CI/CD
• Automation is an integral part of job for both
• Measurement is absolutely key to how both DevOps and SRE work.
• Both accept failures as normal and practice blameless postmortem
SRE Vs DevOps
SRE Vs DevOps (Cont.d)
THANK YOU
• References
• Why SRE? Principles and Practices for Your
Project | EPAM Anywhere Business
• Google - Site Reliability Engineering
(sre.google)
• What is Site Reliability Engineering (SRE) and
How to Build a Reliable Product
(relevant.software)

National Cranberry Cooperative Case
83% (12)
National Cranberry Cooperative Case
7 pages
SRE Practitioner v1.0 Exam Study Guide - July2021
No ratings yet
SRE Practitioner v1.0 Exam Study Guide - July2021
94 pages
Site Reliability Engineering
No ratings yet
Site Reliability Engineering
3 pages
Catchpoint 2021 SRE Report
No ratings yet
Catchpoint 2021 SRE Report
33 pages
ACCOMMODATION PURCHASE With
No ratings yet
ACCOMMODATION PURCHASE With
6 pages
Site Reliability Engineering Handbook by Anupam Singh
No ratings yet
Site Reliability Engineering Handbook by Anupam Singh
299 pages
Catchpoint 2018 SRE Report
No ratings yet
Catchpoint 2018 SRE Report
15 pages
Sample Procedures For Oil or Gas Sale
100% (1)
Sample Procedures For Oil or Gas Sale
4 pages
Enterprise Site Reliability Engineering Contino
No ratings yet
Enterprise Site Reliability Engineering Contino
19 pages
Python in Excel (2024)
100% (13)
Python in Excel (2024)
607 pages
Sre 250821 235741
No ratings yet
Sre 250821 235741
5 pages
M6 - Apply SRE in Your Organization
No ratings yet
M6 - Apply SRE in Your Organization
41 pages
Site Reliability Engineering
No ratings yet
Site Reliability Engineering
9 pages
Cloud & SRE
No ratings yet
Cloud & SRE
4 pages
Tax Certificate for Employees
No ratings yet
Tax Certificate for Employees
4 pages
SRE and DevSecOps Training Content - 20231023
No ratings yet
SRE and DevSecOps Training Content - 20231023
5 pages
Implications of Artificial Intelligence in The Financial Sector
No ratings yet
Implications of Artificial Intelligence in The Financial Sector
3 pages
Employer Branding Revisited Accepted Version
No ratings yet
Employer Branding Revisited Accepted Version
26 pages
PDF1
No ratings yet
PDF1
7 pages
Hari Ram NOVEMBER
No ratings yet
Hari Ram NOVEMBER
1 page
Developing A SRE Culture-English
No ratings yet
Developing A SRE Culture-English
4 pages
White Paper - EDT11 - Site Reliability Engine
No ratings yet
White Paper - EDT11 - Site Reliability Engine
7 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (22)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
M2 - DevOps, SRE, and Why They Exist
No ratings yet
M2 - DevOps, SRE, and Why They Exist
34 pages
Resume Headline Examples
100% (1)
Resume Headline Examples
8 pages
On-Call in Action
No ratings yet
On-Call in Action
13 pages
Site Reliability Engineering Course Content (SRE)
No ratings yet
Site Reliability Engineering Course Content (SRE)
5 pages
SRE 21 ShivagamiGugan SlideDeck
No ratings yet
SRE 21 ShivagamiGugan SlideDeck
27 pages
Mis PPT X
No ratings yet
Mis PPT X
11 pages
Nutanix JD - Sre Role
No ratings yet
Nutanix JD - Sre Role
1 page
SRE and Incident Management
No ratings yet
SRE and Incident Management
58 pages
JD - Chief Engineer SRE
No ratings yet
JD - Chief Engineer SRE
5 pages
Research Paper
No ratings yet
Research Paper
42 pages
Career Framework - SRE
No ratings yet
Career Framework - SRE
12 pages
The Python Bible
97% (31)
The Python Bible
506 pages
MBA-FPX5910 Tralisha Johnson Assessment 4 Attempt 1
No ratings yet
MBA-FPX5910 Tralisha Johnson Assessment 4 Attempt 1
22 pages
Jeff Bezos
No ratings yet
Jeff Bezos
8 pages
Ebook The Sre Transformation
No ratings yet
Ebook The Sre Transformation
8 pages
An Architect's Guide to SRE
No ratings yet
An Architect's Guide to SRE
375 pages
Neha Sharma FS
No ratings yet
Neha Sharma FS
2 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
91% (45)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
M1 - Introduction To The Course
No ratings yet
M1 - Introduction To The Course
23 pages
QA Automation Roadmap Guide
No ratings yet
QA Automation Roadmap Guide
17 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (13)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
Site Reliability Engineering (SRE)
No ratings yet
Site Reliability Engineering (SRE)
3 pages
SRE Course for FAANG Aspirants
No ratings yet
SRE Course for FAANG Aspirants
13 pages
Invoice - Machine Stand
No ratings yet
Invoice - Machine Stand
1 page
Procurement Guidelines for Kenya
No ratings yet
Procurement Guidelines for Kenya
12 pages
(Hunt, J.) A Beginners Guide To Python 3 Programming
96% (47)
(Hunt, J.) A Beginners Guide To Python 3 Programming
440 pages
Case Study Standard Chartered Bank
No ratings yet
Case Study Standard Chartered Bank
21 pages
Group Ideas! - Smart Textiles Report
No ratings yet
Group Ideas! - Smart Textiles Report
37 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
The SRE Report 2024 - Catchpoint
No ratings yet
The SRE Report 2024 - Catchpoint
59 pages
MBA - IV Sem - Strategic Management
No ratings yet
MBA - IV Sem - Strategic Management
4 pages
SRE SRE: Site Reliability Engineering
No ratings yet
SRE SRE: Site Reliability Engineering
3 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
SRE Principles
No ratings yet
SRE Principles
15 pages
Identify Hazards and Risks
100% (1)
Identify Hazards and Risks
69 pages
In The High Court of Delhi at New Delhi: Signature Not Verified
No ratings yet
In The High Court of Delhi at New Delhi: Signature Not Verified
3 pages
Wepik Integrating Site Reliability Engineering and Devops For Enhanced Operational Excellence 20240822082600iu2w
No ratings yet
Wepik Integrating Site Reliability Engineering and Devops For Enhanced Operational Excellence 20240822082600iu2w
8 pages
The Python Manual
97% (32)
The Python Manual
196 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
13 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Daikin Smile Series Marketing Analysis
No ratings yet
Daikin Smile Series Marketing Analysis
22 pages
Applications of PLC & HMI
100% (1)
Applications of PLC & HMI
5 pages
SRE Paper
No ratings yet
SRE Paper
26 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Google Cloud DevOps Engineer Exam Prep Sheet
No ratings yet
Google Cloud DevOps Engineer Exam Prep Sheet
16 pages
Business Law Sample 2024
No ratings yet
Business Law Sample 2024
32 pages
GM Test Series Shedule
No ratings yet
GM Test Series Shedule
8 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (15)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
Enterprise Roadmap To Sre
No ratings yet
Enterprise Roadmap To Sre
62 pages
SREF Brazilian Portuguese Exam Study Guide
No ratings yet
SREF Brazilian Portuguese Exam Study Guide
91 pages
RP State of Sre Report 2022
No ratings yet
RP State of Sre Report 2022
46 pages
Shipping Schedule & Guidelines for Expeditors Vietnam
No ratings yet
Shipping Schedule & Guidelines for Expeditors Vietnam
1 page
They Are Factors To Consider in The Production of A Requirement Which Includes
No ratings yet
They Are Factors To Consider in The Production of A Requirement Which Includes
6 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
SRE Best Practices Guide
No ratings yet
SRE Best Practices Guide
11 pages
Data Structure and Algorithms With Python
100% (15)
Data Structure and Algorithms With Python
369 pages
SRE Essentials: Key Principles & Practices
100% (1)
SRE Essentials: Key Principles & Practices
20 pages
SRE Report 2023 Catchpoint
No ratings yet
SRE Report 2023 Catchpoint
56 pages
LinkedIn's SRE Implementation Guide
No ratings yet
LinkedIn's SRE Implementation Guide
12 pages
SRE Blueprint: Mastering SLOs for Success
No ratings yet
SRE Blueprint: Mastering SLOs for Success
4 pages
Python Programming - 3 Books in - Ryan Turner
73% (15)
Python Programming - 3 Books in - Ryan Turner
193 pages
SRE Job Description
No ratings yet
SRE Job Description
4 pages
Python for Absolute Beginners
92% (13)
Python for Absolute Beginners
161 pages
Site Reliability Engineer Nanodegree Program Syllabus
No ratings yet
Site Reliability Engineer Nanodegree Program Syllabus
16 pages
What Is SRE
100% (1)
What Is SRE
40 pages
Site Reliability Engineering v2
No ratings yet
Site Reliability Engineering v2
115 pages
BSNL Broadband Error Codes Guide
No ratings yet
BSNL Broadband Error Codes Guide
4 pages
Learn Python in A Day
100% (14)
Learn Python in A Day
141 pages
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
No ratings yet
SRE Foundation V1 - 0 - Value Added Resources 11 - 2019
8 pages
SREF Blueprint
No ratings yet
SREF Blueprint
1 page
Coffee Break NumPy PDF
100% (6)
Coffee Break NumPy PDF
211 pages
Cloud ITIL
No ratings yet
Cloud ITIL
92 pages
Data Visualization With Python PDF
93% (15)
Data Visualization With Python PDF
662 pages
Object Oriented Python Tutorial
100% (21)
Object Oriented Python Tutorial
111 pages
Python Notes For Professionals
100% (18)
Python Notes For Professionals
814 pages
Hiring Site Reliability Engineers
No ratings yet
Hiring Site Reliability Engineers
5 pages
SRE & Error Budgets for Reliability
No ratings yet
SRE & Error Budgets for Reliability
45 pages
Site Reliability Engineering Ebook PDF
No ratings yet
Site Reliability Engineering Ebook PDF
21 pages
SRE Insights for Google Cloud Users
No ratings yet
SRE Insights for Google Cloud Users
58 pages
SRE Success: Philosophy, Tools, Habits
No ratings yet
SRE Success: Philosophy, Tools, Habits
31 pages
EBOOK - Python Crash Course For Data Analysis
100% (12)
EBOOK - Python Crash Course For Data Analysis
168 pages
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
100% (3)
Ebook 10 Essential Skills of A Site Reliability Engineer Sre
18 pages
Learn Python Programming For Beginners B08X4CXRRP
100% (8)
Learn Python Programming For Beginners B08X4CXRRP
131 pages
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
100% (2)
SRE SRE at Google. Jamie Wilkinson, Hope Is Not A Strategy. - DOTC Melbourne 2018
43 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
Site Reliability Engineering Ebook
100% (2)
Site Reliability Engineering Ebook
21 pages
Google SRE: Engineering Web Reliability
No ratings yet
Google SRE: Engineering Web Reliability
21 pages
Python Programming Guide Book
100% (20)
Python Programming Guide Book
323 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Call Center Assessment
0% (1)
Call Center Assessment
4 pages
100 Skills To Better Python
100% (10)
100 Skills To Better Python
80 pages
SRE Google Notes
100% (1)
SRE Google Notes
8 pages

Unit 05 - SRE

Uploaded by

Unit 05 - SRE

Uploaded by

SITE RELIABILITY

You might also like