Automatio
Everything should be completely automated.
• If an existing process cannot be automated, it
will be replaced.
• If a proposed process cannot be automated, it
will be rejected.
• The SRE’s job is to automate themselves out of
a job. In practice this means constantly
automating menial tasks and moving on to
solve more interesting problems.
n
Ephemeralit
Servers are ephemeral. They can and will go
away at any time.
• Servers live in auto-scaling groups that self-
heal.
• Servers have health checks that assert the
health of their process(es).
• Servers boot from images that are fully
equipped and operational.
• Configuration management does not run
against existing servers. It is used only to create
images.
• Application servers are stateless.
Engineers are ephemeral. They can and will
go away at any time.
• Engineering workloads are shared. There are no
individual silos.
• Engineering practices are documented.
Documentation is up to date.
• All engineers have access to all codebases.
y
Continuous Integratio
All code changes are made via pull requests,
verified, and approved.
• All code is functionally tested, unit tested, and
linted.
• Linters are extremely opinionated. Engineers
should feel empowered to propose changes to
the rules in isolated discussions and pull
requests.
• Unit tests and linters run on every pull request,
preventing merges when the build fails.
• Functional tests run on every deploy,
preventing (or rolling back) deploys when the
build fails.
n
Continuous Deploymen
Deploys are easy, fast, safe, and frequent.
• Changes are deployed on every merge.
• Deploys do not require any human interaction
or approval.
• Deploy time matters and engineers should
strive to make it faster.
• Deploys can be started manually with a single
button. As many engineers as possible should
have access to the button.
• Rollbacks happen automatically when a failed
deploy is automatically detected.
• Rollbacks are held to all the same standards as
deploys.
• The master branch is the only branch that gets
deployed. All git branching is for the benefit of
the engineer prior to merging the changes into
master.
• It is easy to tell which commit is deployed.
• There is no such thing as a code freeze.
• Features are released by feature flags. Flipping
a flag does not require a deploy. A “flip freeze”
is acceptable.
t
Software Engineerin
SRE’s operate as software engineers, not
system administrators.
• Everything is managed in code. Any change to a
system is a code change.
• Code is written to be read by other engineers. It
is self-documenting.
• All processes are automated with software.
• CI/CD principles apply to all SRE code.
• The entire engineering team has access to all
SRE code.
g
Monitorin
All systems are monitored for critical
metrics.
• Metrics are easily available and consumable in a
single interface.
• Critical metrics are displayed on dashboards for
each system.
• The system that does the monitoring is
monitored by a separate system.
g
Alertin
When self-healing fails, engineers are
intelligently notified.
• Alerts summarize the problem succinctly and
include suggested actions.
• Engineers are only paged off-hours for
production. Other environments may alert
engineers during business hours.
• After resolving the alert as quickly as possible,
the next step (during business hours) is to
ensure the same alert never fires again.
• Excessive alerting is unacceptable. It is
addressed immediately.
g
Incident Respons
On-call engineers (both SRE’s and SE’s) feel
empowered to respond in a timely manner.
• SE’s are on-call for the systems they create and
own.
• SRE’s are on-call for low level systems and to
assist developers.
• All escalation policies have backups or
fallbacks.
• All escalation policies have rotations. No
engineer is on-call for a system full time.
• Escalating is acceptable if needed. Escalation
generates a follow-up task to understand why
the on-call engineer could not solve the
problem.
e
Postmortem
All user-facing incidents require a
postmortem.
• Postmortems are blameless.
• The process for a postmortem is easy to conduct
and has very little overhead. A few sentences is
sometimes sufficient. A meeting is not always
required.
• Postmortems are conducted reasonably soon
after the incident is resolved.
• A repository of postmortems is easily accessible.
s
Securit
Security is automated and baked into
everything.
• Security checks run as part of CI/CD.
• Intrusion detection systems are in place.
• Identity and access management is used to gate
all actions.
• As few infrastructure components as possible
are publicly accessible, ideally zero.
• Client applications only use public APIs.
• Engineers are trusted but verified.
• Credentials are not stored in plain text,
especially not in code.
• Credentials can be easily rotated.
• Access is revoked in a single place, which
propagates to all systems.
Offload security to managed services.
• Servers receive requests through managed load
balancers.
• All data stores receive requests from inside the
network only.
• Static content is delivered through a CDN.
Buckets are private.
y
Financ
SRE’s are financially conscious in all aspects
of their work.
• Costs measurements include engineering time
and effort.
• Tooling is used to monitor all engineering costs
an SRE can affect.
e
Cloud Architectur
An externally managed cloud is the default
place to run services. Running services by
any other means requires justification.
• Multi-region is appropriate when downtime vs
cost is properly measured.
• Multi-cloud (for redundancy) is almost never
worth the effort and loss of features.
On-premise solutions are appropriate when:
• A modern cloud front-end is in place
(OpenStack, etc).
• IT, capacity planning, and system
administration are all top-notch.
• The increased overhead is drastically cost-
effective when engineering time is considered,
and is projected to remain this way for the
foreseeable future.
• SRE’s are not expected to physically interact
with the data center.
e
Containerized orchestration is appropriate
when:
• Services are shown to successfully run in
containers.
• Services are in a healthy state and sufficiently
modularized.
• The increased overhead is deemed acceptable.
• The company is willing to invest heavily in
tooling.
Serverless solutions are appropriate when:
• Tooling and automation are used to managed
serverless functions.
• Service owners are willing to accept the
limitations of serverless.
Supporting Service
The default option for supporting services
(logging, monitoring, alerting, etc) is
externally managed and hosted. Running
these services internally requires
justification.
• SRE’s are constantly evaluating supporting
service options, new and old. The ability to
consolidate is a factor.
• Supporting services are secure, cost effective,
and useful to engineers.
s
Peopl
SRE’s and SE’s are on the same team. They
are all engineers.
• SRE’s are not blockers and allow access to as
many systems as possible.
• SE’s own their services and do not “throw code
over the wall.”
• SRE’s are willing and able to contribute to and
debug application code.
• SRE’s use and contribute to open source, if
possible.
• SE’s and SRE’s work together to plan new
services and architectures.
• SRE’s strive to make the lives of all engineers
better through automation.
e