0% found this document useful (0 votes)

80 views27 pages

Art of Slos Handbook Letter

The document discusses service level objectives (SLOs) and how they can help product teams, development teams, and operations teams engineer systems for reliability. It provides examples of specifying service level indicators (SLIs) to measure availability and latency for request/response systems as well as other scenarios. Key aspects covered include translating user journeys to SLI specifications, defining what requests are valid and responses are successful, and setting appropriate thresholds. The document also includes an example of SLOs and SLIs from a company called Stoker Labs and their user profile experiences.

Uploaded by

Lucas Prado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views27 pages

Art of Slos Handbook Letter

Uploaded by

Lucas Prado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

The Art
of SLOs

1
https://cre.page.link/art-of-slos-handbook

Outage Math 3
How SLOs help… 4
The SLI Equation 5
Specifying SLIs 7
Developing SLOs and SLIs 14
Measuring SLIs 15
Stoker Labs Inc. 17
Service Architecture 18
User Journeys 18
Postmortem: Blank Profile Pages! 24
Profile Page Errors and Latency 25
Resources 26

2
https://cre.page.link/art-of-slos-handbook
Outage Math
Allowed 100% outage duration
Reliability
Level
per year per quarter per 28 days

90% 36d 12h 9d 2d 19h 12m

95% 18d 6h 4d 12h 1d 9h 36m

99% 3d 15h 36m 21h 36m 6h 43m 12s

99.5% 1d 19h 48m 10h 48m 3h 21m 36s

99.9% 8h 45m 36s 2h 9m 36s 40m 19s

99.95% 4h 22m 48s 1h 4m 48s 20m 10s

99.99% 52m 33.6s 12m 57.6s 4m 1.9s

99.999% 5m 15.4s 1m 17.8s 24.2s

Boxes shaded red allow less than one hour of complete outage.

Allowed consistent error% outage duration

per 28 days, at 99.95% reliability

100% 10% 1% 0.1%

20m 10s 3h 21m 36s 1d 9h 36m 14d

3
https://cre.page.link/art-of-slos-handbook
How SLOs help…
...your business engineer for reliability

The product perspective:

If reliability is a feature, when do you
prioritize it versus other features?

The development perspective:

How do you balance the risk to
reliability from changing a system
with the requirement to build new,
cool features for that system?

The operations perspective:

What is the right level of reliability
for the system you support?

4
https://cre.page.link/art-of-slos-handbook
The SLI Equation

The proportion of valid events that were good.

Expressing all SLIs in this form has some useful properties.
1. SLIs fall between 0% and 100%.
0% means nothing works, 100% means nothing is broken.
This scale is intuitive and directly translates to percentage-
reliability SLOs and error budgets.
2. SLIs have a consistent format.
Consistency allows common tooling to be built around SLIs.
Alerting logic, error budget calculations, and SLO analysis
and reporting tools can all be written to expect the same
inputs: good events, valid events, and an SLO threshold.
Events can be prevented from counting against an error budget either by
including them in the numerator or by excluding them from the denom-
inator. The former is achieved by classifying some events good, the
latter by classifying some events invalid.
Typically, for systems serving requests over HTTP(S), validity is deter-
mined by request parameters such as hostname or request path, to
scope the SLI to a particular set of serving tasks or response handlers.
Typically, for data processing systems, validity is determined by input
parameters, to scope the SLI to subsets of the data.

5
https://cre.page.link/art-of-slos-handbook

Translating a user journey to SLI specifications

In general, people use your service to achieve some set of goals, so the
SLIs for that service must measure the interactions they have with the
service in pursuit of those goals. We're going to call a set of interactions
to achieve a single goal a user journey一a term we've borrowed from the
field of user experience research.
An SLI s pecification is a formal statement of your users' expectations
about one particular dimension of reliability for your service, like latency
or availability. The SLI menu gives you guidelines for what dimensions of
reliability you likely want to measure for a given user journey.
Once you have SLIs specified for a system, the next step is to refine
them into i mplementations by making decisions around m easurement,
validity and how to classify events as good.

6
https://cre.page.link/art-of-slos-handbook
Specifying an Availability SLI
The availability of a system serving interactive requests from users is a
critical reliability measure. If your system is not responding to requests
successfully, it's safe to assume it is not meeting your users' expec-
tations of its reliability.

Request / Response
The suggested specification for a request/response Availability SLI is:
The proportion of valid requests served successfully.
Turning this specification into an implementation requires making two
choices: which of the requests this system serves are v
alid for the SLI,
and what makes a response successful?
The definition of success tends to vary widely depending on the role of
the system and the choice of how to measure availability. One com-
monly used signifier of success or failure is the status code of an HTTP
or RPC response. This requires careful, accurate use of status codes
within your system so that each code maps distinctly to either success
or failure.
When considering the availability of an entire user journey, care must be
taken to enumerate and measure the ways that users can voluntarily exit
the journey before completion.

Other Availability SLIs

Availability is a useful measurement concept for a wide range of
scenarios beyond serving requests. The availability of a virtual machine
could be defined as the proportion of minutes that it was booted and
accessible via SSH, for example.
Sometimes, complex logic is required to determine whether a system is
functioning as a user would expect. A reasonable strategy here is to
write that complex logic as code and export a boolean availability
measure to your SLO monitoring systems, for use in a bad-minute style
SLI like the example above.

7
https://cre.page.link/art-of-slos-handbook
Specifying a Latency SLI
The latency of a system serving interactive requests from users is an
important reliability measure. A system is not perceived as "interactive"
by its users if their requests are not responded to in a timely fashion.

Request / Response
The suggested specification for a request/response Latency SLI is:
The proportion of valid requests served faster than a threshold.
Turning this specification into an implementation requires making two
choices: which of the requests this system serves are v
alid for the SLI,
and when does the timer measuring the request latency start and stop?
Setting a threshold for “fast enough” depends on how accurately
measured latency translates to the user experience, and is more closely
related to the SLO target. Systems can be engineered to prioritize the
perception of speed, allowing relatively loose thresholds to be set.
Requests may be made in the background by applications, and thus have
no user waiting for the response.
It can be useful to have multiple thresholds with different SLOs. When a
single threshold is used, it often targets long-tail latency. However, it can
also be useful to set broad-based latency expectations with a secondary
75-90% target, because the translation of perceived latency to unhappi-
ness usually follows an S-curve rather than being binary.

Other Latency SLIs

Latency can be equally important to track for data processing or
asynchronous work-queue tasks. If you have a batch processing pipeline
that runs daily, that pipeline probably shouldn't take more than a day to
complete. Users care more about the time it takes to complete a task
they queued than the latency of the queue acknowledgement.
SLI metrics must be updated as soon as the threshold is crossed, rather
than at the eventual success or failure of a long-running operation. If the
threshold is 30 minutes but processing latency is only reported when a
failure occurs after 2 hours, there is a 90-minute window where that
operation was not measurably missing expectations.

8
https://cre.page.link/art-of-slos-handbook
Specifying a Quality SLI
If your system has mechanisms to trade off the quality of the response
returned to the user for, e.g., lower CPU or memory utilization, you should
track this graceful degradation of service with a quality SLI. Users may
not be consciously aware of the degradation in quality until it becomes
severe, but their subconscious perceptions may still have an impact on
your business if, e.g., degrading quality means serving less relevant ads
to users, reducing click-through rates.

Request / Response
The suggested specification for a request/response Quality SLI is:
The proportion of valid requests served without degrading quality.
Turning this specification into an implementation requires making two
choices: which of the requests this system serves are v
alid for the SLI,
and how to determine whether the response was served with d egraded
quality.
In most cases, the mechanism used by the system to degrade response
quality should also be able to mark responses as degraded or increment
metrics to count them. It is therefore much easier to express this SLI in
terms of "bad events" rather than "good events".
Similar to measuring latency, if the quality degradation falls along a
spectrum, it can be useful to set SLO targets at more than one point
from that spectrum. For a somewhat contrived example of this, consider
a service that fans out incoming requests to 10 optional backends, each
with a 99.9% availability target and the ability to reject requests when
they are overloaded. You might choose to specify that 99% of service
responses must be served with no missing backend responses and
99.9% must be served with no more than one missing response.

9
https://cre.page.link/art-of-slos-handbook
Specifying a Freshness SLI
When batch-processing data, it is common for the utility or relevance of
the outputs to degrade over time as new input data is generated by the
system or its users. The users, in turn, have expectations that the
outputs of the system are up-to-date with respect to those inputs. Data
processing pipelines must be run regularly or perhaps even rebuilt to
process small increments of input data continuously to meet those
expectations. A freshness SLI measures the system's performance
against those expectations and can inform those engineering decisions.

Data Processing
The suggested specification for a data processing Freshness SLI is:
The proportion of valid data updated more recently than a
threshold.
Turning this specification into an implementation requires making two
choices: which of the data this system processes are v
alid for the SLI,
and when does the timer measuring the freshness of data start and
stop?
For a batch-processing system, freshness can be approximated as the
time since the completion of the last successful processing run. More
accurate freshness measurements for batch systems usually require
augmenting processing systems to track generation and/or source age
timestamps. Freshness for incremental streaming processing systems
can also be measured with a watermark that tracks the age of the most
recent record that has been fully processed.

Measuring Data Freshness as Response Quality

Stale serving data is a common way for response quality to be degraded
without a system making an active choice to do so. Measuring stale data
as degraded response quality is a useful strategy: if no user accesses
the stale data, no expectations around the freshness of that data can be
missed. For this to be feasible, the parts of the system responsible for
generating the serving data must also produce a generation timestamp
that the serving infrastructure uses, to check against a freshness thres-
hold when it reads data.

10
https://cre.page.link/art-of-slos-handbook
Specifying a Coverage SLI
A coverage SLI functions similarly to an availability SLI when processing
data in a system. When users have expectations that data will be
processed and the outputs made available to them, you should consider
using a coverage SLI.

Data Processing
The suggested specification for a data processing Coverage SLI is:
The proportion of valid data processed successfully.
Turning this specification into an implementation requires making two
choices: which of the data this system processes are v
alid for the SLI,
and how to determine whether the processing of a particular piece of
data was s uccessful.
For the most part, the system doing the processing of the data ought to
be able to determine whether a record that it began processing was
processed successfully and output counts of success and failure. The
challenge comes from identifying those records that should have been
processed but were skipped for some reason. This usually requires
some way of determining the number of valid records that resides
outside of the data processing system itself, perhaps by running the
equivalent of COUNT(*)on the data source.

11
https://cre.page.link/art-of-slos-handbook
Specifying a Correctness SLI
In some cases, it can be important to measure not just that a processing
system processes all the data it should have, but that it produces the
correct outputs while doing so. Correctness is something best ensured
proactively via good software engineering and testing practice, rather
than detected reactively i n absentia. However, when users have strong
expectations that the data they are accessing has been generated
correctly—and have ways of independently validating that correctness—
having an SLI to measure correctness on an ongoing basis can be
valuable.

Data Processing
The suggested specification for a data processing Correctness SLI is:
The proportion of valid data producing correct output.
Turning this specification into an implementation requires making two
choices: which of the data this system processes are v
alid for the SLI,
and how to determine the correctness of output records.
For a correctness SLI to be useful, the method of determining correct-
ness needs to be independent of the methods used to generate the
output data. Otherwise, it is probable that any correctness bugs that
exist during generation will also exist during validation, preventing the
detection of the resulting incorrectness by the SLI. A common strategy is
to have "golden" input data that produces known-good outputs when
processed. If this input data is sufficiently representative of real user
data, and is designed to exercise most of the processing system's code
paths, then this can be sufficient to estimate overall correctness.

12
https://cre.page.link/art-of-slos-handbook
Specifying a Throughput SLI
Data processing systems may be designed to operate continuously on
streams or small batches of data to lower the processing latency
observed by users. A latency SLI is usually the best way to quantify this,
but the overall throughput of the system may be a better measure when
you have promised to provide your users with a given level of throughput,
or if their expectations of processing latency are not constant, like when
the quantity of data processed per "event" varies dramatically.

Data Processing
The suggested specification for a data processing Throughput SLI is:
The proportion of time where the data processing rate is faster than
a threshold.
The structure of this specification is somewhat different to a latency SLI
because throughput is a rate of events over time. We fit this into the SLI
equation by defining an event as the passage of a unit of time, like a
second or a minute, and a "good" event to be a unit of time where the
rate of processing was fast enough. Turning this specification into an
implementation requires making a single choice: the units of
measurement for your data processing rate.
A common measure of throughput is "bytes per second", since we can
measure the quantity of data to be processed in bytes, and the amount
of work it takes to process data is often proportional to its size. Any
metric that scales roughly linearly with respect to the cost of processing
should work. The system processing the data ought to be able to record
metrics that quantify the rate at which it is processing data.
As with latency and quality, throughput rates are a spectrum where
multiple thresholds may be appropriate. You may be able to tolerate
operating with a reduced overall throughput for hours or even days if
your system is designed with QoS levels to ensure important data still
makes it through quickly while low-priority stuff queues up.
Throughput SLIs can be more useful than latency SLIs for request-driven
services with highly variable latency. Often, the variance is driven by
requests having a variable processing cost, like the difference between
uploading a thumbnail and a 4K hi-resolution image.

13
https://cre.page.link/art-of-slos-handbook
Developing SLOs and SLIs
For each c
ritical user journey, stack-ranked by b
usiness impact

1. Choose an S LI specification f rom the menu

2. Refine the specification into a detailed S LI implementation
3. Walk through the user journey and look for c overage gaps
4. Set S
LOs based on past performance o r business needs

Example SLO Worksheet
Make sure that your SLIs have an e vent, a s
uccess criterion, and specify
where and how you record success or failure. Describe your specification
as the proportion of e vents that were g
ood. Make sure that your SLO
specifies both a target and a m easurement window.

User Journey: Home Page Load

SLI Type: L
atency
SLI Specification:
Proportion of home page requests t hat were served in <
100ms
(Above, “[home page requests] served in <100ms” is the numerator in the SLI
Equation, and “home page requests” is the denominator.)
SLI Implementations:
● Proportion of home page requests served in < 100ms, as
measured from the 'latency' column of the s
erver log.
(Pros/Cons: This measurement will miss requests that fail to reach the backend.)
● Proportion of home page requests served in < 100ms, as
measured by p robers that execute javascript in a browser
running in a virtual machine.
(Pros/Cons: This will catch errors when requests cannot reach our network, but
may miss issues affecting only a subset of users.)

SLO:
99% of home page requests in t he past 28 days served in < 100ms.

14
https://cre.page.link/art-of-slos-handbook
Measuring SLIs
Broadly speaking, there are five ways to measure an SLI, each with their
own set of advantages and disadvantages. Like many engineering
decisions, there is no one right choice for all situations, but with a good
understanding of the trade-offs involved, it is possible to choose SLI
implementations that meet the requirements of the system.
These classes of measurement methods are presented in decreasing
order of their distance from the user. In general, an SLI should measure
the user experience as closely as possible, so proximity to the user and
their interactions with the system is a valuable property.

Logs Processing
Processing server-side logs of requests or data to generate SLI metrics.

Pros Cons
+ Existing request logs can be processed – Application logs do not contain
retroactively to backfill SLI metrics. requests that did not reach servers.
+ Complex user journeys can be – Processing latency makes logs-based
reconstructed using session identifiers. SLIs unsuitable for triggering an
+ Complex logic to derive an SLI operational response.
implementation can be turned into – Engineering effort is needed to
code and exported as two, much generate SLIs from logs; session
simpler, "good events" and "total reconstruction can be time-consuming.
events" counters.

Application Server Metrics

Exporting SLI metrics from the code that is serving requests from users
or processing their data.

Pros Cons
+ Often fast and cheap (in terms of – Application servers are unable to see
engineering time) to add new metrics. requests that do not reach them.
+ Complex logic to derive an SLI – Measuring overall performance of
implementation can be turned into multi-request user journeys is difficult
code and exported as two, much if application servers are stateless.
simpler, "good events" and "total
events" counters.

15
https://cre.page.link/art-of-slos-handbook
Front-end Infrastructure Metrics
Utilizing metrics from load-balancing infrastructure (e.g. GCP's layer 7
load balancer) to measure SLIs.

Pros Cons
+ Metrics and recent historical data most – Not viable for data processing SLIs or,
likely already exist, so this option in fact, any SLIs with complex
probably requires the least engineering requirements.
effort to get started. – Only measure approximate
+ Measures SLIs at the point closest to performance of multi-request user
the user still within serving journeys.
infrastructure.

Synthetic Clients (Probers) or Data

Building a client that sends fabricated requests at regular intervals and
validates the responses. For data processing pipelines, creating
synthetic, known-good, input data and validating outputs.

Pros Cons
+ Synthetic clients can measure all steps – Approximates user experience with
of a multi-request user journey. synthetic requests.
+ Sending requests from outside your – Covering all corner cases is hard and
infrastructure captures more of the can devolve into integration testing.
overall request path in the SLI. – High reliability targets require frequent
probing for accurate measurement.
– Probe traffic can drown out real traffic.

Client Instrumentation
Adding observability features to the client the user is interacting with and
logging events back to your serving infrastructure that track SLIs.

Pros Cons
+ Provides the most accurate measure of – Client logs ingestion, and processing
user experience. latency make these SLIs unsuitable for
+ Can quantify reliability of third parties, triggering an operational response.
e.g., CDN or payments providers. – SLI measurements contain a number of
highly variable factors potentially
outside of direct control.
– Building instrumentation into the client
can involve lots of engineering work.

16
https://cre.page.link/art-of-slos-handbook
Stoker Labs Inc.
This is a work of fiction. Names, businesses, events, and game mechanics
are either the products of the author’s imagination or used in a fictitious
manner. Any resemblance to actual games, living or dead, is purely
coincidental.

Mission Statement
Our company's mission is to "replace conflict with games". The division
we see throughout society becomes harmful when people take life too
seriously. We aim to provide an outlet for the competitive urges so
central to human nature via the medium of mobile video gaming. We
firmly believe that providing ways for people to sublimate these urges in
a manner that is fun rather than psychologically and physically harmful
will bring about a more cooperative and successful world.

Our Game: Fang Faction

The rise of the vampires has taken a devastating toll on humanity, for-
cing those who survived to cluster together in the few remaining habit-
able regions, far from previous centers of civilization. As the leader of a
faction of survivors, you must recruit people to your cause, secure and
upgrade your settlement, raid vampire-occupied cities, and battle other
factions for control of resources.
The game world is split up into a number of areas with varying rewards
and challenges. Access to areas with better rewards is gated by overall
playtime, settlement size and in-game currency expenditure. Each area
has its own leader board ranking the top factions.
We have around 50 million monthly active users playing, with between
one and ten million players online at any given time. We add new world
areas once per month, which drives a spike in both traffic and revenues.
The primary revenue stream stems from the exchange of real-world
money for in-game currency via purchases in the app. Players can also
earn currency by winning battles against other players, playing mini-
games, or over time, via control of in-game resource production. Players
can spend in-game currency on settlement upgrades, defensive em-
placements for battles, and by playing a recruitment mini-game that
gives them a chance of recruiting highly-skilled people to their faction.

17
https://cre.page.link/art-of-slos-handbook
Service Architecture

The game has both a mobile client and a web UI. The mobile client
makes requests to our serving infrastructure via JSON RPC messages
transmitted over RESTful HTTP. It also maintains a web socket connect-
ion to receive game-state updates. Browsers talk to the web servers via
HTTPS. Leader boards are updated every 5 minutes.

User Journeys
View Profile Page
Players can log into their game account, view their settlement and make
profile changes from a web browser. A player loading their profile page
is a simple journey that we will go through together in the workshop.

18
https://cre.page.link/art-of-slos-handbook
Buy In-Game Currency
Our most important user journey is the one that generates all our
revenue: users buying in-game currency via in-app purchases. Requests
to the Play Store are only visible from the client. We see between 0.1 and
1 completed purchase every second; this spikes to 10 purchases per
second after the release of a new area as players try to meet its
requirements.

19
https://cre.page.link/art-of-slos-handbook
App Launch
There are three parts to the app launch process, depending on whether
the user already has an account and whether that account has been
previously accessed on the current device. Account creation and auth
token request rates are low, but we see between 20 and 100 QPS of
syncData requests, spiking to 1000 after the release of a new area.

20
https://cre.page.link/art-of-slos-handbook
Manage Settlement
Settlement management is a trio of relatively simple API requests.
Upgrading settlements, building defenses and recruiting faction
members consumes wall-clock time as well as in-game currency.
Players spend a lot of time managing their settlements: we see
2000-3000 requests per second across all these API endpoints, spiking
up to 10,000/s. The game servers "tick" to update state for all players in
an area once per second. If they consistently take more than a second to
compute the new game state, users will notice their buildings not
completing on time.

21
https://cre.page.link/art-of-slos-handbook
Battle Another Player
Launching an attack on another player spins up a real-time "tower
defence" battle where the attackers troops try to overrun the defenders'
emplacements. The defender can deploy some of their troops to aid
their defence. Both sides get points in proportion to the amount of
damage they dealt to the opposition. We see around 100 attacks
launched every second.

22
https://cre.page.link/art-of-slos-handbook
Generate Leader Boards
Competition for the top spots is fierce because players in a given area
primarily battle each other and are of similar skill levels. Battles are
scheduled across the pool of game servers on a least-loaded basis.
Battle scores are written to a PubSub feed at the end of a battle by the
game server that hosted that battle. This feed is read by a processor
running alongside the leader board’s data store which has a number of
responsibilities. It updates the score tables for each game area and
maintains global tables of the top attacking and defending scores for
individual games, over the last rolling hour and previous full day.
Every five minutes, the top 50 snapshots of all tables are generated by
querying the leader board store and writing the results to an in-memory
key-value store. Previous snapshots are kept for 30 minutes and then
garbage collected. Finally, all game completions and their scores are
recorded to an append-only file for each hour, which is archived to a
cloud storage service.

23
https://cre.page.link/art-of-slos-handbook
Postmortem: Blank Profile Pages
Impact
From 08:43 to 13:17 CEST, users accessing their profile pages received
incomplete responses. This rendered them unable to view or edit their
profile.
Root Causes and Trigger
The proximate root cause was a bug in the web server's handling of
unicode HTML templates. The trigger was c ommit a6d78d13, which
changed the profile page template to support localization, but at the
same time accidentally introduced unicode quotation marks (U+201C “ ,
U+201D ”) into the template HTML. When the web server encountered
these instead of the standard ascii quotation mark (U+0022 " ) , the
template engine aborted rendering of the output.
Detection
Because the aborted rendering process did not throw an exception, the
HTTP status code for the incomplete responses was still 2
00 OK. The
problem thus went undetected by our SLO-based alerts. The support and
social media teams manually escalated concerns about a substantially
increased level of complaints relating to the profile page at 12:14 CEST.
Lessons Learned
Things that went well:
● Support and social media teams were able to find the correct esc-
alation path and successfully contact the ops team.
Things that went poorly:
● HTTP status code SLIs could not detect incomplete responses.
● Web server used a severely outdated, vendored version of the
templating engine with a substantially broken unicode support.
Where we got lucky:
● User profile page is relatively unimportant to our revenue stream.
Action Items
… to be determined.

24
https://cre.page.link/art-of-slos-handbook
Profile Page Errors and Latency
NOTE: both these graphs have logarithmic Y axes.
We use logarithmic axes here to make it easier to perceive details in the
graphs. It prevents the large spikes in errors or latency from swamping
the smaller, more consistent background variation.

25
https://cre.page.link/art-of-slos-handbook
Resources
You can find links to all the resources associated with this workshop at
https://cre.page.link/art-of-slos. All the content is released to the public
under the Creative Commons CC-BY-4.0 license.
If you've found any errors in the content please file a bug:
https://cre.page.link/art-of-slos-bug
If you have suggestions for improvement or additional content, we'd love
to hear from you:
https://cre.page.link/art-of-slos-improve
Or if you just want to ask a question or get other help with the workshop:
https://cre.page.link/art-of-slos-help
If you'd like to know more about this subject, you can find our Coursera
course on Measuring and Managing Reliability here:
https://cre.page.link/coursera
Thanks for participating!

26
https://cre.page.link/art-of-slos-handbook

Art of Slos Handboo
No ratings yet
Art of Slos Handboo
27 pages
Art of Slos Handbook A4
No ratings yet
Art of Slos Handbook A4
27 pages
Srecon18emea Slides Fong-Jones PDF
No ratings yet
Srecon18emea Slides Fong-Jones PDF
27 pages
Practical Work 2 - Designing SLOs and SLIs
No ratings yet
Practical Work 2 - Designing SLOs and SLIs
4 pages
SLA SLO SLI Explained
No ratings yet
SLA SLO SLI Explained
173 pages
Alerting Policies
No ratings yet
Alerting Policies
56 pages
The Slo Playbook Balancing Innovation and Reliability
No ratings yet
The Slo Playbook Balancing Innovation and Reliability
8 pages
SRE Basics for IT Professionals
No ratings yet
SRE Basics for IT Professionals
5 pages
Understanding Service Level Agreements in Healthcare A4 092019 - Web Final
No ratings yet
Understanding Service Level Agreements in Healthcare A4 092019 - Web Final
10 pages
Effective Logging For Optimal Observability
No ratings yet
Effective Logging For Optimal Observability
19 pages
Slo Adoption and Usage in Sre
No ratings yet
Slo Adoption and Usage in Sre
104 pages
SRE Blueprint: Mastering SLOs for Success
No ratings yet
SRE Blueprint: Mastering SLOs for Success
4 pages
2023-01-28 Implementing Service-Level Objectives SLO SLI
No ratings yet
2023-01-28 Implementing Service-Level Objectives SLO SLI
12 pages
Service Management
No ratings yet
Service Management
29 pages
Google Cloud DevOps Engineer Exam Prep Sheet
No ratings yet
Google Cloud DevOps Engineer Exam Prep Sheet
16 pages
SRE and Incident Management
No ratings yet
SRE and Incident Management
58 pages
Node js+Monitoring,+Alerting+and+Reliability+101+by+RisingStack+-+2nd+Edition
No ratings yet
Node js+Monitoring,+Alerting+and+Reliability+101+by+RisingStack+-+2nd+Edition
35 pages
EU AI Act SERVICE LEVELS DOCUMENT
No ratings yet
EU AI Act SERVICE LEVELS DOCUMENT
10 pages
Service Level Agreement Management
No ratings yet
Service Level Agreement Management
7 pages
IS303 Architectural Analysis: SMU SIS Personal Notes
No ratings yet
IS303 Architectural Analysis: SMU SIS Personal Notes
86 pages
Sre 250821 235741
No ratings yet
Sre 250821 235741
5 pages
Example of Basic SLA Calculation
No ratings yet
Example of Basic SLA Calculation
2 pages
LinkedIn's SRE Implementation Guide
No ratings yet
LinkedIn's SRE Implementation Guide
12 pages
SLA Metrics, Measurement and Manipulation
No ratings yet
SLA Metrics, Measurement and Manipulation
8 pages
Article 2 - Specifying The Requirements
No ratings yet
Article 2 - Specifying The Requirements
7 pages
Cloud Service Management Quiz
No ratings yet
Cloud Service Management Quiz
2 pages
J Satyanarayana: Framework of Slas For E-Gov Projects
No ratings yet
J Satyanarayana: Framework of Slas For E-Gov Projects
27 pages
Hasan 2011
No ratings yet
Hasan 2011
11 pages
Google Cloud DevOps Exam Prep Guide
No ratings yet
Google Cloud DevOps Exam Prep Guide
10 pages
SRE Metrics - 250822 - 000217
No ratings yet
SRE Metrics - 250822 - 000217
7 pages
M8 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT
No ratings yet
M8 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT
42 pages
Research Document For Git Business Case 2021
No ratings yet
Research Document For Git Business Case 2021
8 pages
On-Call in Action
No ratings yet
On-Call in Action
13 pages
Lecture 2 Software Development and Testing
No ratings yet
Lecture 2 Software Development and Testing
31 pages
SLA and SLO
No ratings yet
SLA and SLO
23 pages
Paper 15
No ratings yet
Paper 15
21 pages
SRE & Error Budgets for Reliability
No ratings yet
SRE & Error Budgets for Reliability
45 pages
WP Grey Failures June 21 v1 A1agLPKg00uWaBLV
No ratings yet
WP Grey Failures June 21 v1 A1agLPKg00uWaBLV
3 pages
SLOs, SLIs, and SRE Basics Explained
No ratings yet
SLOs, SLIs, and SRE Basics Explained
11 pages
Top 10 Architecture Characteristics
No ratings yet
Top 10 Architecture Characteristics
11 pages
SRE Success: Philosophy, Tools, Habits
No ratings yet
SRE Success: Philosophy, Tools, Habits
31 pages
Software Reliability Software Failure Measures of Reliability & Availability Software Safety Quality Standards ISO 9000 CMM SQA Plan
No ratings yet
Software Reliability Software Failure Measures of Reliability & Availability Software Safety Quality Standards ISO 9000 CMM SQA Plan
21 pages
SEQA Session 6 Software Reliability
No ratings yet
SEQA Session 6 Software Reliability
107 pages
Software Architecture & Technology of Large-Scale Systems PDF
No ratings yet
Software Architecture & Technology of Large-Scale Systems PDF
11 pages
4B Purchase Descriptions, Service Specifications
No ratings yet
4B Purchase Descriptions, Service Specifications
5 pages
Monitoring Cheatsheet
No ratings yet
Monitoring Cheatsheet
3 pages
Reliability and Availability in Software
No ratings yet
Reliability and Availability in Software
3 pages
SLO Guide & FAQ
No ratings yet
SLO Guide & FAQ
15 pages
1-2021 - Bass, Clements, Kazman - Software Architecture in Practice, 4th Edition-Addison-Wesley-172-249
No ratings yet
1-2021 - Bass, Clements, Kazman - Software Architecture in Practice, 4th Edition-Addison-Wesley-172-249
78 pages
Information Technology Support Service Level II: LO 2: Revise Practices, Where Appropriate
No ratings yet
Information Technology Support Service Level II: LO 2: Revise Practices, Where Appropriate
14 pages
NEs and Their KPIs (HSS)
100% (1)
NEs and Their KPIs (HSS)
19 pages
SRE Best Practices Guide
No ratings yet
SRE Best Practices Guide
11 pages
An Architect's Guide to SRE
No ratings yet
An Architect's Guide to SRE
375 pages
Service Level Agreements Presentation
No ratings yet
Service Level Agreements Presentation
27 pages
Module 1 Student
No ratings yet
Module 1 Student
45 pages
FSPM
No ratings yet
FSPM
45 pages
Csol 520 Assignment 2-Group 4-Enterprise Information Security Architecture
No ratings yet
Csol 520 Assignment 2-Group 4-Enterprise Information Security Architecture
16 pages
Power System Reliability Assessment: Keynote Address On
No ratings yet
Power System Reliability Assessment: Keynote Address On
61 pages
Unit 8
No ratings yet
Unit 8
17 pages
Web Database Application
No ratings yet
Web Database Application
421 pages
Commerce Cloud B2C Development Guide
No ratings yet
Commerce Cloud B2C Development Guide
208 pages
Dasar Pem Rogram An Go Lang
No ratings yet
Dasar Pem Rogram An Go Lang
618 pages
Django CRUD App Setup Guide
No ratings yet
Django CRUD App Setup Guide
7 pages
Vp-Uml Quick Start: Last Update: 26
No ratings yet
Vp-Uml Quick Start: Last Update: 26
36 pages
WTT CMS
100% (1)
WTT CMS
4 pages
3 Problems For ODXO
No ratings yet
3 Problems For ODXO
4 pages
Index: © Sten Vesterli 2019 S. Vesterli, Oracle Visual Builder Cloud Service Revealed
No ratings yet
Index: © Sten Vesterli 2019 S. Vesterli, Oracle Visual Builder Cloud Service Revealed
6 pages
Programming Phoenix Liveview Beta
100% (3)
Programming Phoenix Liveview Beta
299 pages
ServiceNow Developer Expertise
No ratings yet
ServiceNow Developer Expertise
3 pages
Final Synopsis
No ratings yet
Final Synopsis
64 pages
Full Stack Developer Resume SEO
No ratings yet
Full Stack Developer Resume SEO
6 pages
Undergraduate Thesis and Project Guidelines - BSIT
100% (1)
Undergraduate Thesis and Project Guidelines - BSIT
4 pages
Wordpress Pods - User Guide
100% (6)
Wordpress Pods - User Guide
18 pages
Django: Writing Your First Django App, Part 4
No ratings yet
Django: Writing Your First Django App, Part 4
5 pages
EOS Web and Multimedia L3 & L4
100% (1)
EOS Web and Multimedia L3 & L4
97 pages
AEM Slightly Cheatsheet Part 1
No ratings yet
AEM Slightly Cheatsheet Part 1
16 pages
Sans Titre
No ratings yet
Sans Titre
20 pages
By Laws
67% (3)
By Laws
22 pages
The Content Management Team & HTML
No ratings yet
The Content Management Team & HTML
85 pages
Python Flask Developer Interview Questions and Answers - Markdown
No ratings yet
Python Flask Developer Interview Questions and Answers - Markdown
170 pages
E-Tech Q4 W3
No ratings yet
E-Tech Q4 W3
9 pages
Jaspersoft Studio User Guide PDF
No ratings yet
Jaspersoft Studio User Guide PDF
362 pages
Web Developer's Career Profile
No ratings yet
Web Developer's Career Profile
5 pages
Hostpot Manual
No ratings yet
Hostpot Manual
39 pages
Old4a Python Python Made Easy 1 Hacking Beginners PDF
100% (2)
Old4a Python Python Made Easy 1 Hacking Beginners PDF
92 pages
A Mostly Complete Guide To Webpack 5 (2020)
No ratings yet
A Mostly Complete Guide To Webpack 5 (2020)
17 pages
Delcam - FeatureCAM 2013 WhatsNew en - 2012
No ratings yet
Delcam - FeatureCAM 2013 WhatsNew en - 2012
108 pages
Django MCQ1
No ratings yet
Django MCQ1
11 pages
Anitha UI Resume
No ratings yet
Anitha UI Resume
6 pages