Major incident management: An overview
It's Monday morning and things are pretty normal at your service desk. Suddenly, you get an
alert ticket that a critical service is down, and within the next 15 minutes you start getting an
influx of tickets reporting the same issue. It could be that your website is down, your point of
sale software has stopped working, or something even more far-reaching, like the stock
exchange going down or planes being grounded. When your business is severely impacted
by an IT issue causing loss of revenue and/or reputation, you have a major incident on your
hands.
How you react to a major incident makes all the difference in minimizing the impact of the
incident and bringing services back up. As they say, time is money, and in this case, that
couldn't be more true. If your organization has a major incident management (MIM) process
in place, you can swiftly respond to and resolve major incidents. If you don't have such a
process in place, it's time to draw up an emergency response plan, also known as a
major incident response process.
The stakes of a major incident are higher than ever before, and according to a study by
Information Technology Intelligence Consulting, 98 percent of organizations lose at least
$100,000 from an hour of downtime. This reinforces the importance of setting up a MIM
process that can effectively and efficiently tackle major incidents.
Every organization aims to eliminate major incidents, but the bottom line is that major
incidents are impossible to prevent completely and the only thing you can do is be prepared
for them.
In this guide, we'll look at how to set up an effective MIM process, common mistakes that
can affect your organization's MIM, and best practices for improving your MIM process.
But first, what makes an incident a major incident?
What is a major incident?
A major incident is a high-impact, urgent issue that usually affects the whole organization or
a major part of it. A major incident almost always results in an organization's services
becoming unavailable, which causes the organization's business to take a hit and ultimately
affects its financial standing. There are two ways a major incident can affect an
organization's services:
By preventing customers from accessing the organization's services. The Cloudflare
outage in July 2019 is an example of customers being affected by a major incident.
This major outage affected almost half the internet and left millions of internet users
unable to access various services.
By disrupting employees' ability to complete their work on time, leading to a
business disruption. IndiGo's outage in November 2019 affected the airline's check-in
process, which led to long delays and affected thousands of passengers.
A well-prepared service desk is equipped to assess major incidents and come up with
solutions or workarounds to reduce and control the impact of a major incident.
The 4 stages of a major incident
Major incidents are considered to have 4 main stages, namely:
Identification
Containment
Resolution
Maintenance
The major incident management process
A major incident management process is a must-have for organizations, as it helps them
minimize the business impact of a major incident. The major incident management process
primarily consists of the following steps:
Stage 1: Identification
Declaring the major incident:
The first step is to identify possible major incidents. It is important for organizations to set up
multiple methods of identifying threats. Major incidents can be flagged by technicians when
they come across unusual tickets, or they can be detected by solutions like network
monitoring tools that can automatically flag a network issue and create a ticket to alert the
service desk. Organizations can also set up a dedicated hotline for service desk personnel to
flag suspected major incidents.
Informing stakeholders:
Once a major incident has been identified, it needs to be communicated to all key
stakeholders. There are four main groups that need to be informed of major incidents:
Technical team: It is important to inform the technical team immediately so they can
start deciding on a course of action to fix the issue.
Management: Keeping upper management, like the CIO, informed about major
incidents helps with accountability. Organizations should also keep management
informed of all the steps taken to fix major incidents.
Key stakeholders: The department heads and service-level business management
staff also need to be informed of major incidents and receive regular status updates.
Users: Users need to know which services may be unavailable due to a major
incident.
Stage 2: Containment
Assembling the major incident team
A major incident team, or MIT for short, consists of technicians, service-level management
heads, and other key stakeholders; sometimes highly skilled external personnel are brought
in to tackle a major incident. The MIT works together to find a fix for the major incident and
bring operations back to normal.
Setting up a conference bridge
A conference bridge, more commonly known as a conference call, helps with effective
troubleshooting and centralized communication. It acts as a clear, fast channel of
communication between members of the MIT.
Preparing a designated war room
Having a designated war room allows all members of the MIT to gather and troubleshoot the
incident. This increases collaboration efforts, helping the MIT come up with a solution faster.
Creating a problem ticket to identify underlying issues
A problem ticket can be created to discover and understand the root cause of the major
incident. This can help prevent similar major incidents in the future by addressing the causes
of the major incident.
Stage 3: Resolution
Implementing the resolution plan as a change
It is good practice to implement the fix for the major incident as a change to ensure that the
resolution is properly documented and implemented. Implementing the resolution as a
change minimizes the risk of a botched resolution disrupting other services.
Stage 4: Maintenance
Performing a post-implementation review
It is important to take stock of the incident over a period of time to make sure it's truly
resolved. If underlying issues are left unresolved, they could lead to another major incident.
Producing clear documentation
Documenting the entire process of resolving the major incident helps the organization
prepare for similar incidents in the future. With proper documentation of past incidents, the
organization can implement the tried and tested solution immediately when faced with
another similar major incident, reducing its impact.
Measuring metrics
Measuring the performance of the service desk helps gauge the effectiveness of the service
desk and the MIM process. Some important metrics to measure are mean time to
acknowledge (MTTA), mean time to resolve (MTTR), total number of major incidents, and
average downtime for major incidents.
Tick all the boxes for an effective major incident management process
Try ServiceDesk Plus
Book a personalized demo
Major IT incident management process flow chart
Major incident management roles and responsibilities
A major incident calls for a special group of personnel to tackle the incident and resolve it.
MIM roles include:
Service desk technicians
Service desk technicians are the first line of defense against major incidents. They analyze
incident tickets and escalate them to the incident manager. Service desk technicians are also
involved in the implementation of resolutions.
Major incident manager
The major incident manager is the owner of the major incident. Their role includes declaring
the incident as a major incident and ensuring that the MIM process is followed and the
incident is resolved at the earliest. They act as the main point of contact for any information
about the major incident, and manage the MIT.
MIT
An MIT is a specialized team that is responsible for analyzing the major incident and
formulating an action plan to handle the threat. The MIT ideally consists of service desk
technicians, service-level management personnel, technical staff, other relevant
stakeholders, and external consultants if the situation requires it.
Technical staff
The specialized personnel that are responsible for the upkeep of infrastructure and
operations, including sysadmins, network administrators, and information security staff, that
make up an organization's technical staff. The technical staff help troubleshoot the major
incident and are primarily responsible for implementing the major incident resolution.
Change manager
The change manager is the owner of the change that is created to implement the fix for the
major incident. The change manager takes full ownership of the change ticket and is
accountable for it.
Problem manager
If a problem is created in response to the major incident, the problem manager owns the
problem ticket. The problem manager tries to ascertain the root causes of the incident and
ensure it doesn't occur again, or that the organization is at least prepared for the next time
the incident occurs.
External consultants or third-party vendors
In some cases, the major incident may require highly specialized personnel to help
understand and troubleshoot the incident. The major incident manager identifies the
required personnel and adds them to the MIT to help reduce the impact of the major
incident.
RACI matrix
An RACI matrix defines the responsibilities of various stakeholders in a process. The table
below defines the roles and responsibilities of the major incident stakeholders throughout
the MIM process.
Major
Service desk incident Technical Change Problem
Process/roles technicians manager MIT staff manager manager Ex
Identification
Declaring the
major incident C A R C I I I
Informing
stakeholders C A R I I I I
Containment
Assembling the
MIT I R/A C C I C I
Setting up a
conference
bridge I A R C I C I
Preparing a
designated war
room I A R I I C I
Creating a
problem ticket
to identify
underlying
issues I A R C I I I
Resolution
Implementing
the resolution
plan as a
change I I I R A C C
Maintenance
Performing
post-
implementation
review I C I R A C I
Producing clear
documentation C A R C C C C
Measuring
metrics I A R I I I C
* R - Responsible, A - Accountable, C - Consulted, I - Informed
5 Common mistakes in major incident management
Here are 5 common mistakes that can hinder your MIM process:
1. Manual communication and escalation
By far the biggest challenge to MIM is communication. In the event of a major incident,
various stakeholders need to be informed of the status of the incident, its severity, and what
troubleshooting has been done to fix it. Communicating all this manually is an arduous task,
and can lead to inconsistent communication, which only makes matters worse. By
automating the process, key stakeholders are notified throughout the entire ticket life cycle,
and the major incident manager can focus their entire attention on fixing the issue.
2. Ineffective channels for reporting major incidents
Every service desk receives tens or even hundreds of tickets a day, ranging from laptop
issues to service requests; among this mountain of tickets, there could be a few potential
major incidents. Not setting up a separate channel to report major incidents delays the
identification of major incidents.
3. Duplication of efforts
Failure to delegate tasks in an organized manner can cause duplication of efforts within the
MIT. It is important to assign tasks and keep the MIT informed of what each member is
tasked with.
4. Poor documentation
Lack of proper documentation will force the MIT to reinvent the wheel every time a similar
major incident occurs, leading to delays in resolving major incidents and causing
unnecessary downtime.
5. Failure to analyze the root cause
Similar to incident management, MIM can be myopic in scope, as its primary focus is to fix
the issue and get services up and running within the shortest possible time. If not combined
with problem management to identify underlying issues, the underlying cause of a major
incident will continue to make the organization vulnerable to major incidents.
5 Major incident management best practices
Here are the best ways to approach the MIM process
1. Enable multiple channels for reporting major incidents
When it comes to handling major incidents, time is of the essence. It is vital for organizations
to identify and classify major incidents as soon as they are detected. Offering users multiple
ways to report incidents will make the entire process faster and more accessible. You can
enable ticket creation through email or a web portal, or even set up a dedicated hotline to
report suspected major incidents. Setting up network monitoring software to detect
anomalies can help you proactively deal with major incidents.
2. Automate service desk processes
Speed and efficiency play a vital role in controlling the impact of a major incident, and
automating various service desk processes helps achieve this by freeing up your technicians
from repetitive tasks such as notifying stakeholders. Automating the notification system and
setting up major incident workflows are good ways of automating service desk processes to
improve resolution time and bring structure to your MIM process.
3. Strive for prompt, relevant communication
It is important to keep your organization's management and important stakeholders
informed of every major incident. Keeping management in the loop will help with getting
necessary approvals and permissions required to fix the major incident. Prompt
communication ensures that all the major incident personnel are on the same page and
allows for smooth, effective collaboration; it also keeps end users informed of any possible
downtime so they can prepare for it.
4. Create clear documentation
Clear documentation helps the major incident manager record all the work done to fix the
major incident, its impact, the affected services, and other key information about the major
incident. This documentation is important to show management the benefit of having a
MIM process, including its ROI. Clear documentation will also help with any similar major
incident in the future.
5. Utilize deep integrations with ITOM software
Strong integrations with ITOM software enables the IT department to proactively handle
major incidents. Reactive major incident identification relies on an influx of tickets to raise a
red flag that a major incident is in progress. On the other hand, a proactive MIM process
that utilizes ITOM integrations has systems in place to monitor networks and services, and
can automatically flag anomalies that could be potential major incidents.
Learn how to set up your own best practice major incident management process
Book a personalized demoGet a customized quote
Major incident management metrics and KPIs
When it comes to MIM, below are some important metrics and KPIs to track.
KPI Formula Co
Th
The average time from when a major incident can
Mean time to resolution (MTTR) is reported to when it is resolved. as
Mean time to acknowledge (MTTA) The average time to respond to a major As
incident. is q
The average time between failures. It is Th
calculated by dividing the total uptime by the pe
Mean time between failure (MTBF) total number of failures. yo
Th
The average time taken to detect major ide
Mean time to detect (MTTD) incidents or anomalies. ser
Percent increase or decrease of major The percent increase of problems in Th
incidents subsequent months relative to the first month. occ
Major incident scenario
It is important to remember that not all high-priority incidents are major incidents. Since the
MIM process involves a sizable commitment of resources like implementing a separate MIT,
it is important to carefully classify major incidents.
Source: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
The 2019 Cloudflare outage is a very good example of what defines a major incident. In this
case, a standard operating procedure of updating a managed rule for the web application
firewall (WAF) spiked the usage of CPUs dedicated to serving HTTP/HTTPS traffic to nearly
100 percent across the servers in Cloudflare's network. The outage that followed resulted in
a reduction of 80 percent of Cloudflare's traffic, and affected millions of internet users
around the world.
Impact: Large
The outage resulted in Cloudflare customers (and their customers) seeing a 502 error page
when visiting any Cloudflare domain. The 502 errors were generated by the front-end
Cloudflare web servers that still had CPU cores available but were unable to reach the
processes that serve HTTP/HTTPS traffic. It's estimated that at least half of the entire
internet was inaccessible for the twenty-seven minutes of downtime.
Urgency: High
All Cloudflare websites were inaccessible, causing service disruptions for thousands of
organizations and millions of users. The outage affected the internal operations of
Cloudflare, too, preventing the Cloudflare employees from accessing various services like the
company's change management tool and internal control panel. The outage had to be dealt
with to resume normal service operations.
Timeline of events from detection to resolution:
The WAF managed rule was implemented at 13:42; three minutes later, Cloudflare's network
operation tools started flagging the drop in traffic, many other end-to-end tests of Cloudflare
services began failing, end users noticed various 502 errors, and Cloudflare received many
reports of CPU exhaustion from its points of presence in cities worldwide.
The site reliability engineering team, London engineering team, and other relevant teams
were brought together to troubleshoot and come up with a fix. At 14:00, the WAF was
identified as the cause of the incident. And at 14:07, a global WAF kill was implemented to
bring traffic levels back to normal.
By 14:52, Cloudflare was 100-percent satisfied that it understood the cause of the outage
and had a fix in place, so the WAF was re-enabled globally.
Learn from Zylker's experience and overcome major incidents even when working in a
hybrid environment with ServiceDesk Plus.
Glossary
Change
The addition, modification, or removal of anything that can have a direct or indirect effect on
services.
Change management
The process of taking changes to completion with minimum disruptions and collisions.
Escalation
The act of transferring ownership of a ticket based on a functional or hierarchical need.
Event
An occurrence that has significance for the management of a service or asset.
Failure
An occurrence where a service or asset does not function according to the agreed SLA.
Hierarchical escalation
The act of transferring ownership vertically to a higher tier service desk technician or
relevant authority.
Impact
A measure of the severity of an incident.
Incident
An unplanned interruption to an IT service, or a reduction in the quality of an IT service.
Failure of a configuration item, even if it has not yet affected a service, is also an incident
(e.g. failure of one disk from a mirror set).
Incident management
The process of managing the life cycle of all incidents to restore normal service operations as
quickly as possible and minimize business impact.
Incident prioritization
Assigning priorities to incidents and defining what constitutes a major incident.
Major incident
An incident that has a high impact and high urgency, requiring a separate process from
incident management.
Major incident manager
The person who is responsible for the MIT and the implementation of the MIM process.
Mean time to acknowledge (MTTA)
A measurement of how quickly an incident is acknowledged by the service desk.
Mean time to detect (MTTD)
A measurement of how quickly a potential threat to a service or configuration item is
detected.
Mean time between failures (MTBF)
A measurement of how frequently a service or asset fails.
Mean time to repair/resolve/respond/recover (MTTR)
A measurement of how quickly a service is restored after failure.
Normal service operation
A service operation that adheres to the service level agreement (SLA).
Problem
A cause or potential cause of one or more incidents.
RACI matrix
It defines the roles and responsibilities in cross-functional or departmental projects and
processes.
Service desk
The point of communication between service providers and the organization's users.
Service desk manager
The one who oversees day-to-day activities of the service desk and is responsible for its
performance.
Service-level objective (SLO)
It defines the objective of the service providers, and is a means of measuring their
performance.
SLA
An agreement between the service provider and the customer about the expected level of
service and the expected time in which it is delivered.
Urgency
A measure of how quickly an incident needs to be resolved.