Site reliability engineering (SRE)
Site reliability engineering (SRE) is a software engineering approach to IT
operations. SRE teams use software as a tool to manage systems, solve problems,
and automate operations tasks.
SRE takes the tasks that have historically been done by operations teams, often
manually, and instead gives them to engineers or operations teams who use
software and automation to solve problems and manage production systems.
SRE is a valuable practice when creating scalable and highly reliable software
systems. It helps manage large systems through code, which is more scalable and
sustainable for system administrators (sysadmins) managing thousands or
hundreds of thousands of machines.
SRE can also reduce or remove much of the natural friction between development
teams because some teams want to continually release new or updated software
into production. However, operations teams don't want to release any type of
update or new software without being sure it won't cause outages or other
operations problems. As a result, while not strictly required for DevOps, SRE aligns
closely with DevOps principles and can play an important role in DevOps success.
What is site reliability engineering?
Site reliability engineering (SRE) uses software engineering to automate IT
operations tasks - for example production system management, change
management, incident response, even emergency response - that would otherwise
be performed manually by systems administrators (sysadmins).
The principle behind SRE is that using software code to automate oversight of large
software systems is a more scalable and sustainable strategy than manual
intervention - especially as those systems extend or migrate to the cloud.
SRE can also reduce or remove much of the natural friction between development
teams who want to continually release new or updated software into production,
and operations teams who don't want to release any type of update or new software
without being sure it won't cause outages or other operations problems. As a result,
while not strictly required for DevOps, SRE aligns closely with DevOps principles and
can be play an important role in DevOps success.
The concept of SRE is credited to Ben Treynor Sloss, VP of engineering at Google,
who famously wrote that "SRE is what happens when you ask a software engineer to
design an operations team.
What does a site reliability engineer do?
A site reliability engineer is a unique role that requires either a background as a
sysadmin, a software developer with additional operations experience, or someone
in an IT operations role that also has software development skills.
SRE teams are responsible for how code is deployed, configured, and monitored, as
well as the availability, latency, change management, emergency response, and
capacity management of services in production.
SRE teams determine the launch of new features by using service-level agreements
(SLAs) to define the required reliability of the system through service-level
indicators (SLI) and service-level objectives (SLO).
An SLI measures specific aspects of provided service levels. Key SLIs include request
latency, availability, error rate, and system throughput. An SLO is based on the
target value or range for a specified service level based on the SLI.
An SLO for the required system reliability is then based on the downtime
determined to be acceptable. This downtime level is referred to as an error budget—
the maximum allowable threshold for errors and outages.
SRE and DevOps
DevOps is a modern way to deliver higher quality applications faster - by
automating the software delivery lifecycle, and by giving development and
operations teams more shared responsibility and more input into each other’s work.
Like SRE, DevOps makes a business more agile by balancing the need to deliver
more applications and changes faster with the need to avoid 'breaking' the
production environment. And like SRE, DevOps aims to achieve this balance by
establishing an acceptable risk of errors. In fact, SRE and DevOps seem so similar
that some experts say they're the same thing—but most see SRE practices as
excellent ways to implement DevOps principles.
SREs can code
you throw people at a reliability problem and keep pushing (sometimes for a year or
more) until the problem either goes away or blows up in your face.
Not so in SRE. Both the development and SRE teams share a single staffing pool, so
for every SRE that is hired, one less developer headcount is available (and vice
versa). This ends the never-ending headcount battle between Dev and Ops, and
creates a self-policing system where developers get rewarded with more teammates
for writing better performing code (i.e., code that needs less support from fewer
SREs).
SRE benefits
Gain greater visibility into service health
Quantify the cost of downtime
Optimize incident response
Build a modern network operations center
Migration from traditional IT and on-premises data centers to hybrid
cloud environments is one of the chief reasons that the average enterprise
generates two to three times more operations data every year. Increasingly, SRE is
seen as being critical for leveraging this data to automate systems administration,
operations and incident response, and to improve enterprise reliability even as the
IT environment becomes more complex.