Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
209 views24 pages

Major Incident Management

The document provides an overview of major incident management (MIM), detailing its importance in minimizing business impact during critical IT issues. It outlines the stages of a major incident, the MIM process, common mistakes, best practices, and key performance metrics. The document emphasizes the necessity of having a structured MIM process to effectively handle major incidents and reduce downtime and financial losses.

Uploaded by

lennss857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views24 pages

Major Incident Management

The document provides an overview of major incident management (MIM), detailing its importance in minimizing business impact during critical IT issues. It outlines the stages of a major incident, the MIM process, common mistakes, best practices, and key performance metrics. The document emphasizes the necessity of having a structured MIM process to effectively handle major incidents and reduce downtime and financial losses.

Uploaded by

lennss857
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Major incident management: An overview

It's Monday morning and things are pretty normal at your service desk. Suddenly, you get an
alert ticket that a critical service is down, and within the next 15 minutes you start getting an
influx of tickets reporting the same issue. It could be that your website is down, your point of
sale software has stopped working, or something even more far-reaching, like the stock
exchange going down or planes being grounded. When your business is severely impacted
by an IT issue causing loss of revenue and/or reputation, you have a major incident on your
hands.

How you react to a major incident makes all the difference in minimizing the impact of the
incident and bringing services back up. As they say, time is money, and in this case, that
couldn't be more true. If your organization has a major incident management (MIM) process
in place, you can swiftly respond to and resolve major incidents. If you don't have such a
process in place, it's time to draw up an emergency response plan, also known as a
major incident response process.

The stakes of a major incident are higher than ever before, and according to a study by
Information Technology Intelligence Consulting, 98 percent of organizations lose at least
$100,000 from an hour of downtime. This reinforces the importance of setting up a MIM
process that can effectively and efficiently tackle major incidents.

Every organization aims to eliminate major incidents, but the bottom line is that major
incidents are impossible to prevent completely and the only thing you can do is be prepared
for them.

In this guide, we'll look at how to set up an effective MIM process, common mistakes that
can affect your organization's MIM, and best practices for improving your MIM process.

But first, what makes an incident a major incident?

What is a major incident?


A major incident is a high-impact, urgent issue that usually affects the whole organization or
a major part of it. A major incident almost always results in an organization's services
becoming unavailable, which causes the organization's business to take a hit and ultimately
affects its financial standing. There are two ways a major incident can affect an
organization's services:

 By preventing customers from accessing the organization's services. The Cloudflare


outage in July 2019 is an example of customers being affected by a major incident.
This major outage affected almost half the internet and left millions of internet users
unable to access various services.

 By disrupting employees' ability to complete their work on time, leading to a


business disruption. IndiGo's outage in November 2019 affected the airline's check-in
process, which led to long delays and affected thousands of passengers.
A well-prepared service desk is equipped to assess major incidents and come up with
solutions or workarounds to reduce and control the impact of a major incident.

The 4 stages of a major incident

Major incidents are considered to have 4 main stages, namely:

 Identification

 Containment

 Resolution

 Maintenance

The major incident management process

A major incident management process is a must-have for organizations, as it helps them


minimize the business impact of a major incident. The major incident management process
primarily consists of the following steps:
Stage 1: Identification

Declaring the major incident:

The first step is to identify possible major incidents. It is important for organizations to set up
multiple methods of identifying threats. Major incidents can be flagged by technicians when
they come across unusual tickets, or they can be detected by solutions like network
monitoring tools that can automatically flag a network issue and create a ticket to alert the
service desk. Organizations can also set up a dedicated hotline for service desk personnel to
flag suspected major incidents.

Informing stakeholders:

Once a major incident has been identified, it needs to be communicated to all key
stakeholders. There are four main groups that need to be informed of major incidents:
 Technical team: It is important to inform the technical team immediately so they can
start deciding on a course of action to fix the issue.

 Management: Keeping upper management, like the CIO, informed about major
incidents helps with accountability. Organizations should also keep management
informed of all the steps taken to fix major incidents.

 Key stakeholders: The department heads and service-level business management


staff also need to be informed of major incidents and receive regular status updates.

 Users: Users need to know which services may be unavailable due to a major
incident.

Stage 2: Containment

Assembling the major incident team


A major incident team, or MIT for short, consists of technicians, service-level management
heads, and other key stakeholders; sometimes highly skilled external personnel are brought
in to tackle a major incident. The MIT works together to find a fix for the major incident and
bring operations back to normal.

Setting up a conference bridge

A conference bridge, more commonly known as a conference call, helps with effective
troubleshooting and centralized communication. It acts as a clear, fast channel of
communication between members of the MIT.

Preparing a designated war room

Having a designated war room allows all members of the MIT to gather and troubleshoot the
incident. This increases collaboration efforts, helping the MIT come up with a solution faster.

Creating a problem ticket to identify underlying issues

A problem ticket can be created to discover and understand the root cause of the major
incident. This can help prevent similar major incidents in the future by addressing the causes
of the major incident.
Stage 3: Resolution

Implementing the resolution plan as a change

It is good practice to implement the fix for the major incident as a change to ensure that the
resolution is properly documented and implemented. Implementing the resolution as a
change minimizes the risk of a botched resolution disrupting other services.
Stage 4: Maintenance

Performing a post-implementation review

It is important to take stock of the incident over a period of time to make sure it's truly
resolved. If underlying issues are left unresolved, they could lead to another major incident.

Producing clear documentation

Documenting the entire process of resolving the major incident helps the organization
prepare for similar incidents in the future. With proper documentation of past incidents, the
organization can implement the tried and tested solution immediately when faced with
another similar major incident, reducing its impact.

Measuring metrics
Measuring the performance of the service desk helps gauge the effectiveness of the service
desk and the MIM process. Some important metrics to measure are mean time to
acknowledge (MTTA), mean time to resolve (MTTR), total number of major incidents, and
average downtime for major incidents.

Tick all the boxes for an effective major incident management process

 Try ServiceDesk Plus

Book a personalized demo

Major IT incident management process flow chart

Major incident management roles and responsibilities


A major incident calls for a special group of personnel to tackle the incident and resolve it.
MIM roles include:

Service desk technicians

Service desk technicians are the first line of defense against major incidents. They analyze
incident tickets and escalate them to the incident manager. Service desk technicians are also
involved in the implementation of resolutions.

Major incident manager

The major incident manager is the owner of the major incident. Their role includes declaring
the incident as a major incident and ensuring that the MIM process is followed and the
incident is resolved at the earliest. They act as the main point of contact for any information
about the major incident, and manage the MIT.
MIT

An MIT is a specialized team that is responsible for analyzing the major incident and
formulating an action plan to handle the threat. The MIT ideally consists of service desk
technicians, service-level management personnel, technical staff, other relevant
stakeholders, and external consultants if the situation requires it.

Technical staff

The specialized personnel that are responsible for the upkeep of infrastructure and
operations, including sysadmins, network administrators, and information security staff, that
make up an organization's technical staff. The technical staff help troubleshoot the major
incident and are primarily responsible for implementing the major incident resolution.

Change manager

The change manager is the owner of the change that is created to implement the fix for the
major incident. The change manager takes full ownership of the change ticket and is
accountable for it.

Problem manager

If a problem is created in response to the major incident, the problem manager owns the
problem ticket. The problem manager tries to ascertain the root causes of the incident and
ensure it doesn't occur again, or that the organization is at least prepared for the next time
the incident occurs.

External consultants or third-party vendors

In some cases, the major incident may require highly specialized personnel to help
understand and troubleshoot the incident. The major incident manager identifies the
required personnel and adds them to the MIT to help reduce the impact of the major
incident.

RACI matrix

An RACI matrix defines the responsibilities of various stakeholders in a process. The table
below defines the roles and responsibilities of the major incident stakeholders throughout
the MIM process.

Major
Service desk incident Technical Change Problem
Process/roles technicians manager MIT staff manager manager Ex

Identification
Declaring the
major incident C A R C I I I

Informing
stakeholders C A R I I I I

Containment

Assembling the
MIT I R/A C C I C I

Setting up a
conference
bridge I A R C I C I

Preparing a
designated war
room I A R I I C I

Creating a
problem ticket
to identify
underlying
issues I A R C I I I

Resolution

Implementing
the resolution
plan as a
change I I I R A C C
Maintenance

Performing
post-
implementation
review I C I R A C I

Producing clear
documentation C A R C C C C

Measuring
metrics I A R I I I C

* R - Responsible, A - Accountable, C - Consulted, I - Informed

5 Common mistakes in major incident management


Here are 5 common mistakes that can hinder your MIM process:

1. Manual communication and escalation

By far the biggest challenge to MIM is communication. In the event of a major incident,
various stakeholders need to be informed of the status of the incident, its severity, and what
troubleshooting has been done to fix it. Communicating all this manually is an arduous task,
and can lead to inconsistent communication, which only makes matters worse. By
automating the process, key stakeholders are notified throughout the entire ticket life cycle,
and the major incident manager can focus their entire attention on fixing the issue.

2. Ineffective channels for reporting major incidents

Every service desk receives tens or even hundreds of tickets a day, ranging from laptop
issues to service requests; among this mountain of tickets, there could be a few potential
major incidents. Not setting up a separate channel to report major incidents delays the
identification of major incidents.

3. Duplication of efforts

Failure to delegate tasks in an organized manner can cause duplication of efforts within the
MIT. It is important to assign tasks and keep the MIT informed of what each member is
tasked with.

4. Poor documentation

Lack of proper documentation will force the MIT to reinvent the wheel every time a similar
major incident occurs, leading to delays in resolving major incidents and causing
unnecessary downtime.

5. Failure to analyze the root cause

Similar to incident management, MIM can be myopic in scope, as its primary focus is to fix
the issue and get services up and running within the shortest possible time. If not combined
with problem management to identify underlying issues, the underlying cause of a major
incident will continue to make the organization vulnerable to major incidents.

5 Major incident management best practices


Here are the best ways to approach the MIM process

1. Enable multiple channels for reporting major incidents

When it comes to handling major incidents, time is of the essence. It is vital for organizations
to identify and classify major incidents as soon as they are detected. Offering users multiple
ways to report incidents will make the entire process faster and more accessible. You can
enable ticket creation through email or a web portal, or even set up a dedicated hotline to
report suspected major incidents. Setting up network monitoring software to detect
anomalies can help you proactively deal with major incidents.

2. Automate service desk processes

Speed and efficiency play a vital role in controlling the impact of a major incident, and
automating various service desk processes helps achieve this by freeing up your technicians
from repetitive tasks such as notifying stakeholders. Automating the notification system and
setting up major incident workflows are good ways of automating service desk processes to
improve resolution time and bring structure to your MIM process.

3. Strive for prompt, relevant communication

It is important to keep your organization's management and important stakeholders


informed of every major incident. Keeping management in the loop will help with getting
necessary approvals and permissions required to fix the major incident. Prompt
communication ensures that all the major incident personnel are on the same page and
allows for smooth, effective collaboration; it also keeps end users informed of any possible
downtime so they can prepare for it.

4. Create clear documentation

Clear documentation helps the major incident manager record all the work done to fix the
major incident, its impact, the affected services, and other key information about the major
incident. This documentation is important to show management the benefit of having a
MIM process, including its ROI. Clear documentation will also help with any similar major
incident in the future.

5. Utilize deep integrations with ITOM software

Strong integrations with ITOM software enables the IT department to proactively handle
major incidents. Reactive major incident identification relies on an influx of tickets to raise a
red flag that a major incident is in progress. On the other hand, a proactive MIM process
that utilizes ITOM integrations has systems in place to monitor networks and services, and
can automatically flag anomalies that could be potential major incidents.

Learn how to set up your own best practice major incident management process

Book a personalized demoGet a customized quote

Major incident management metrics and KPIs

When it comes to MIM, below are some important metrics and KPIs to track.

KPI Formula Co

Th
The average time from when a major incident can
Mean time to resolution (MTTR) is reported to when it is resolved. as

Mean time to acknowledge (MTTA) The average time to respond to a major As


incident. is q

The average time between failures. It is Th


calculated by dividing the total uptime by the pe
Mean time between failure (MTBF) total number of failures. yo

Th
The average time taken to detect major ide
Mean time to detect (MTTD) incidents or anomalies. ser

Percent increase or decrease of major The percent increase of problems in Th


incidents subsequent months relative to the first month. occ

Major incident scenario


It is important to remember that not all high-priority incidents are major incidents. Since the
MIM process involves a sizable commitment of resources like implementing a separate MIT,
it is important to carefully classify major incidents.

Source: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

The 2019 Cloudflare outage is a very good example of what defines a major incident. In this
case, a standard operating procedure of updating a managed rule for the web application
firewall (WAF) spiked the usage of CPUs dedicated to serving HTTP/HTTPS traffic to nearly
100 percent across the servers in Cloudflare's network. The outage that followed resulted in
a reduction of 80 percent of Cloudflare's traffic, and affected millions of internet users
around the world.

Impact: Large
The outage resulted in Cloudflare customers (and their customers) seeing a 502 error page
when visiting any Cloudflare domain. The 502 errors were generated by the front-end
Cloudflare web servers that still had CPU cores available but were unable to reach the
processes that serve HTTP/HTTPS traffic. It's estimated that at least half of the entire
internet was inaccessible for the twenty-seven minutes of downtime.

Urgency: High

All Cloudflare websites were inaccessible, causing service disruptions for thousands of
organizations and millions of users. The outage affected the internal operations of
Cloudflare, too, preventing the Cloudflare employees from accessing various services like the
company's change management tool and internal control panel. The outage had to be dealt
with to resume normal service operations.

Timeline of events from detection to resolution:

The WAF managed rule was implemented at 13:42; three minutes later, Cloudflare's network
operation tools started flagging the drop in traffic, many other end-to-end tests of Cloudflare
services began failing, end users noticed various 502 errors, and Cloudflare received many
reports of CPU exhaustion from its points of presence in cities worldwide.

The site reliability engineering team, London engineering team, and other relevant teams
were brought together to troubleshoot and come up with a fix. At 14:00, the WAF was
identified as the cause of the incident. And at 14:07, a global WAF kill was implemented to
bring traffic levels back to normal.

By 14:52, Cloudflare was 100-percent satisfied that it understood the cause of the outage
and had a fix in place, so the WAF was re-enabled globally.

Learn from Zylker's experience and overcome major incidents even when working in a
hybrid environment with ServiceDesk Plus.

Glossary
Change

The addition, modification, or removal of anything that can have a direct or indirect effect on
services.

Change management

The process of taking changes to completion with minimum disruptions and collisions.

Escalation

The act of transferring ownership of a ticket based on a functional or hierarchical need.

Event

An occurrence that has significance for the management of a service or asset.

Failure
An occurrence where a service or asset does not function according to the agreed SLA.

Hierarchical escalation

The act of transferring ownership vertically to a higher tier service desk technician or
relevant authority.

Impact

A measure of the severity of an incident.

Incident

An unplanned interruption to an IT service, or a reduction in the quality of an IT service.


Failure of a configuration item, even if it has not yet affected a service, is also an incident
(e.g. failure of one disk from a mirror set).

Incident management

The process of managing the life cycle of all incidents to restore normal service operations as
quickly as possible and minimize business impact.

Incident prioritization

Assigning priorities to incidents and defining what constitutes a major incident.

Major incident

An incident that has a high impact and high urgency, requiring a separate process from
incident management.

Major incident manager

The person who is responsible for the MIT and the implementation of the MIM process.

Mean time to acknowledge (MTTA)

A measurement of how quickly an incident is acknowledged by the service desk.

Mean time to detect (MTTD)

A measurement of how quickly a potential threat to a service or configuration item is


detected.

Mean time between failures (MTBF)

A measurement of how frequently a service or asset fails.

Mean time to repair/resolve/respond/recover (MTTR)

A measurement of how quickly a service is restored after failure.

Normal service operation


A service operation that adheres to the service level agreement (SLA).

Problem

A cause or potential cause of one or more incidents.

RACI matrix

It defines the roles and responsibilities in cross-functional or departmental projects and


processes.

Service desk

The point of communication between service providers and the organization's users.

Service desk manager

The one who oversees day-to-day activities of the service desk and is responsible for its
performance.

Service-level objective (SLO)

It defines the objective of the service providers, and is a means of measuring their
performance.

SLA

An agreement between the service provider and the customer about the expected level of
service and the expected time in which it is delivered.

Urgency

A measure of how quickly an incident needs to be resolved.

You might also like