0% found this document useful (0 votes)

209 views24 pages

Major Incident Management

The document provides an overview of major incident management (MIM), detailing its importance in minimizing business impact during critical IT issues. It outlines the stages of a major incident, the MIM process, common mistakes, best practices, and key performance metrics. The document emphasizes the necessity of having a structured MIM process to effectively handle major incidents and reduce downtime and financial losses.

Uploaded by

lennss857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views24 pages

Major Incident Management

Uploaded by

lennss857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Major incident management: An overview

It's Monday morning and things are pretty normal at your service desk. Suddenly, you get an
alert ticket that a critical service is down, and within the next 15 minutes you start getting an
influx of tickets reporting the same issue. It could be that your website is down, your point of
sale software has stopped working, or something even more far-reaching, like the stock
exchange going down or planes being grounded. When your business is severely impacted
by an IT issue causing loss of revenue and/or reputation, you have a major incident on your
hands.

How you react to a major incident makes all the difference in minimizing the impact of the
incident and bringing services back up. As they say, time is money, and in this case, that
couldn't be more true. If your organization has a major incident management (MIM) process
in place, you can swiftly respond to and resolve major incidents. If you don't have such a
process in place, it's time to draw up an emergency response plan, also known as a
major incident response process.

The stakes of a major incident are higher than ever before, and according to a study by
Information Technology Intelligence Consulting, 98 percent of organizations lose at least
$100,000 from an hour of downtime. This reinforces the importance of setting up a MIM
process that can effectively and efficiently tackle major incidents.

Every organization aims to eliminate major incidents, but the bottom line is that major
incidents are impossible to prevent completely and the only thing you can do is be prepared
for them.

In this guide, we'll look at how to set up an effective MIM process, common mistakes that
can affect your organization's MIM, and best practices for improving your MIM process.

But first, what makes an incident a major incident?

What is a major incident?

A major incident is a high-impact, urgent issue that usually affects the whole organization or
a major part of it. A major incident almost always results in an organization's services
becoming unavailable, which causes the organization's business to take a hit and ultimately
affects its financial standing. There are two ways a major incident can affect an
organization's services:

 By preventing customers from accessing the organization's services. The Cloudflare

outage in July 2019 is an example of customers being affected by a major incident.
This major outage affected almost half the internet and left millions of internet users
unable to access various services.

 By disrupting employees' ability to complete their work on time, leading to a

business disruption. IndiGo's outage in November 2019 affected the airline's check-in
process, which led to long delays and affected thousands of passengers.
A well-prepared service desk is equipped to assess major incidents and come up with
solutions or workarounds to reduce and control the impact of a major incident.

The 4 stages of a major incident

Major incidents are considered to have 4 main stages, namely:

 Identification

 Containment

 Resolution

 Maintenance

The major incident management process

A major incident management process is a must-have for organizations, as it helps them

minimize the business impact of a major incident. The major incident management process
primarily consists of the following steps:
Stage 1: Identification

Declaring the major incident:

The first step is to identify possible major incidents. It is important for organizations to set up
multiple methods of identifying threats. Major incidents can be flagged by technicians when
they come across unusual tickets, or they can be detected by solutions like network
monitoring tools that can automatically flag a network issue and create a ticket to alert the
service desk. Organizations can also set up a dedicated hotline for service desk personnel to
flag suspected major incidents.

Informing stakeholders:

Once a major incident has been identified, it needs to be communicated to all key
stakeholders. There are four main groups that need to be informed of major incidents:
 Technical team: It is important to inform the technical team immediately so they can
start deciding on a course of action to fix the issue.

 Management: Keeping upper management, like the CIO, informed about major
incidents helps with accountability. Organizations should also keep management
informed of all the steps taken to fix major incidents.

 Key stakeholders: The department heads and service-level business management

staff also need to be informed of major incidents and receive regular status updates.

 Users: Users need to know which services may be unavailable due to a major
incident.

Stage 2: Containment

Assembling the major incident team

A major incident team, or MIT for short, consists of technicians, service-level management
heads, and other key stakeholders; sometimes highly skilled external personnel are brought
in to tackle a major incident. The MIT works together to find a fix for the major incident and
bring operations back to normal.

Setting up a conference bridge

A conference bridge, more commonly known as a conference call, helps with effective
troubleshooting and centralized communication. It acts as a clear, fast channel of
communication between members of the MIT.

Preparing a designated war room

Having a designated war room allows all members of the MIT to gather and troubleshoot the
incident. This increases collaboration efforts, helping the MIT come up with a solution faster.

Creating a problem ticket to identify underlying issues

A problem ticket can be created to discover and understand the root cause of the major
incident. This can help prevent similar major incidents in the future by addressing the causes
of the major incident.
Stage 3: Resolution

Implementing the resolution plan as a change

It is good practice to implement the fix for the major incident as a change to ensure that the
resolution is properly documented and implemented. Implementing the resolution as a
change minimizes the risk of a botched resolution disrupting other services.
Stage 4: Maintenance

Performing a post-implementation review

It is important to take stock of the incident over a period of time to make sure it's truly
resolved. If underlying issues are left unresolved, they could lead to another major incident.

Producing clear documentation

Documenting the entire process of resolving the major incident helps the organization
prepare for similar incidents in the future. With proper documentation of past incidents, the
organization can implement the tried and tested solution immediately when faced with
another similar major incident, reducing its impact.

Measuring metrics
Measuring the performance of the service desk helps gauge the effectiveness of the service
desk and the MIM process. Some important metrics to measure are mean time to
acknowledge (MTTA), mean time to resolve (MTTR), total number of major incidents, and
average downtime for major incidents.

Tick all the boxes for an effective major incident management process

 Try ServiceDesk Plus

Book a personalized demo

Major IT incident management process flow chart

Major incident management roles and responsibilities

A major incident calls for a special group of personnel to tackle the incident and resolve it.
MIM roles include:

Service desk technicians

Service desk technicians are the first line of defense against major incidents. They analyze
incident tickets and escalate them to the incident manager. Service desk technicians are also
involved in the implementation of resolutions.

Major incident manager

The major incident manager is the owner of the major incident. Their role includes declaring
the incident as a major incident and ensuring that the MIM process is followed and the
incident is resolved at the earliest. They act as the main point of contact for any information
about the major incident, and manage the MIT.
MIT

An MIT is a specialized team that is responsible for analyzing the major incident and
formulating an action plan to handle the threat. The MIT ideally consists of service desk
technicians, service-level management personnel, technical staff, other relevant
stakeholders, and external consultants if the situation requires it.

Technical staff

The specialized personnel that are responsible for the upkeep of infrastructure and
operations, including sysadmins, network administrators, and information security staff, that
make up an organization's technical staff. The technical staff help troubleshoot the major
incident and are primarily responsible for implementing the major incident resolution.

Change manager

The change manager is the owner of the change that is created to implement the fix for the
major incident. The change manager takes full ownership of the change ticket and is
accountable for it.

Problem manager

If a problem is created in response to the major incident, the problem manager owns the
problem ticket. The problem manager tries to ascertain the root causes of the incident and
ensure it doesn't occur again, or that the organization is at least prepared for the next time
the incident occurs.

External consultants or third-party vendors

In some cases, the major incident may require highly specialized personnel to help
understand and troubleshoot the incident. The major incident manager identifies the
required personnel and adds them to the MIT to help reduce the impact of the major
incident.

RACI matrix

An RACI matrix defines the responsibilities of various stakeholders in a process. The table
below defines the roles and responsibilities of the major incident stakeholders throughout
the MIM process.

Major
Service desk incident Technical Change Problem
Process/roles technicians manager MIT staff manager manager Ex

Identification
Declaring the
major incident C A R C I I I

Informing
stakeholders C A R I I I I

Containment

Assembling the
MIT I R/A C C I C I

Setting up a
conference
bridge I A R C I C I

Preparing a
designated war
room I A R I I C I

Creating a
problem ticket
to identify
underlying
issues I A R C I I I

Resolution

Implementing
the resolution
plan as a
change I I I R A C C
Maintenance

Performing
post-
implementation
review I C I R A C I

Producing clear
documentation C A R C C C C

Measuring
metrics I A R I I I C

* R - Responsible, A - Accountable, C - Consulted, I - Informed

5 Common mistakes in major incident management

Here are 5 common mistakes that can hinder your MIM process:

1. Manual communication and escalation

By far the biggest challenge to MIM is communication. In the event of a major incident,
various stakeholders need to be informed of the status of the incident, its severity, and what
troubleshooting has been done to fix it. Communicating all this manually is an arduous task,
and can lead to inconsistent communication, which only makes matters worse. By
automating the process, key stakeholders are notified throughout the entire ticket life cycle,
and the major incident manager can focus their entire attention on fixing the issue.

2. Ineffective channels for reporting major incidents

Every service desk receives tens or even hundreds of tickets a day, ranging from laptop
issues to service requests; among this mountain of tickets, there could be a few potential
major incidents. Not setting up a separate channel to report major incidents delays the
identification of major incidents.

3. Duplication of efforts

Failure to delegate tasks in an organized manner can cause duplication of efforts within the
MIT. It is important to assign tasks and keep the MIT informed of what each member is
tasked with.

4. Poor documentation

Lack of proper documentation will force the MIT to reinvent the wheel every time a similar
major incident occurs, leading to delays in resolving major incidents and causing
unnecessary downtime.

5. Failure to analyze the root cause

Similar to incident management, MIM can be myopic in scope, as its primary focus is to fix
the issue and get services up and running within the shortest possible time. If not combined
with problem management to identify underlying issues, the underlying cause of a major
incident will continue to make the organization vulnerable to major incidents.

5 Major incident management best practices

Here are the best ways to approach the MIM process

1. Enable multiple channels for reporting major incidents

When it comes to handling major incidents, time is of the essence. It is vital for organizations
to identify and classify major incidents as soon as they are detected. Offering users multiple
ways to report incidents will make the entire process faster and more accessible. You can
enable ticket creation through email or a web portal, or even set up a dedicated hotline to
report suspected major incidents. Setting up network monitoring software to detect
anomalies can help you proactively deal with major incidents.

2. Automate service desk processes

Speed and efficiency play a vital role in controlling the impact of a major incident, and
automating various service desk processes helps achieve this by freeing up your technicians
from repetitive tasks such as notifying stakeholders. Automating the notification system and
setting up major incident workflows are good ways of automating service desk processes to
improve resolution time and bring structure to your MIM process.

3. Strive for prompt, relevant communication

It is important to keep your organization's management and important stakeholders

informed of every major incident. Keeping management in the loop will help with getting
necessary approvals and permissions required to fix the major incident. Prompt
communication ensures that all the major incident personnel are on the same page and
allows for smooth, effective collaboration; it also keeps end users informed of any possible
downtime so they can prepare for it.

4. Create clear documentation

Clear documentation helps the major incident manager record all the work done to fix the
major incident, its impact, the affected services, and other key information about the major
incident. This documentation is important to show management the benefit of having a
MIM process, including its ROI. Clear documentation will also help with any similar major
incident in the future.

5. Utilize deep integrations with ITOM software

Strong integrations with ITOM software enables the IT department to proactively handle
major incidents. Reactive major incident identification relies on an influx of tickets to raise a
red flag that a major incident is in progress. On the other hand, a proactive MIM process
that utilizes ITOM integrations has systems in place to monitor networks and services, and
can automatically flag anomalies that could be potential major incidents.

Learn how to set up your own best practice major incident management process

Book a personalized demoGet a customized quote

Major incident management metrics and KPIs

When it comes to MIM, below are some important metrics and KPIs to track.

KPI Formula Co

Th
The average time from when a major incident can
Mean time to resolution (MTTR) is reported to when it is resolved. as

Mean time to acknowledge (MTTA) The average time to respond to a major As

incident. is q

The average time between failures. It is Th

calculated by dividing the total uptime by the pe
Mean time between failure (MTBF) total number of failures. yo

Th
The average time taken to detect major ide
Mean time to detect (MTTD) incidents or anomalies. ser

Percent increase or decrease of major The percent increase of problems in Th

incidents subsequent months relative to the first month. occ

Major incident scenario

It is important to remember that not all high-priority incidents are major incidents. Since the
MIM process involves a sizable commitment of resources like implementing a separate MIT,
it is important to carefully classify major incidents.

Source: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

The 2019 Cloudflare outage is a very good example of what defines a major incident. In this
case, a standard operating procedure of updating a managed rule for the web application
firewall (WAF) spiked the usage of CPUs dedicated to serving HTTP/HTTPS traffic to nearly
100 percent across the servers in Cloudflare's network. The outage that followed resulted in
a reduction of 80 percent of Cloudflare's traffic, and affected millions of internet users
around the world.

Impact: Large
The outage resulted in Cloudflare customers (and their customers) seeing a 502 error page
when visiting any Cloudflare domain. The 502 errors were generated by the front-end
Cloudflare web servers that still had CPU cores available but were unable to reach the
processes that serve HTTP/HTTPS traffic. It's estimated that at least half of the entire
internet was inaccessible for the twenty-seven minutes of downtime.

Urgency: High

All Cloudflare websites were inaccessible, causing service disruptions for thousands of
organizations and millions of users. The outage affected the internal operations of
Cloudflare, too, preventing the Cloudflare employees from accessing various services like the
company's change management tool and internal control panel. The outage had to be dealt
with to resume normal service operations.

Timeline of events from detection to resolution:

The WAF managed rule was implemented at 13:42; three minutes later, Cloudflare's network
operation tools started flagging the drop in traffic, many other end-to-end tests of Cloudflare
services began failing, end users noticed various 502 errors, and Cloudflare received many
reports of CPU exhaustion from its points of presence in cities worldwide.

The site reliability engineering team, London engineering team, and other relevant teams
were brought together to troubleshoot and come up with a fix. At 14:00, the WAF was
identified as the cause of the incident. And at 14:07, a global WAF kill was implemented to
bring traffic levels back to normal.

By 14:52, Cloudflare was 100-percent satisfied that it understood the cause of the outage
and had a fix in place, so the WAF was re-enabled globally.

Learn from Zylker's experience and overcome major incidents even when working in a
hybrid environment with ServiceDesk Plus.

Glossary
Change

The addition, modification, or removal of anything that can have a direct or indirect effect on
services.

Change management

The process of taking changes to completion with minimum disruptions and collisions.

Escalation

The act of transferring ownership of a ticket based on a functional or hierarchical need.

Event

An occurrence that has significance for the management of a service or asset.

Failure
An occurrence where a service or asset does not function according to the agreed SLA.

Hierarchical escalation

The act of transferring ownership vertically to a higher tier service desk technician or
relevant authority.

Impact

A measure of the severity of an incident.

Incident

An unplanned interruption to an IT service, or a reduction in the quality of an IT service.

Failure of a configuration item, even if it has not yet affected a service, is also an incident
(e.g. failure of one disk from a mirror set).

Incident management

The process of managing the life cycle of all incidents to restore normal service operations as
quickly as possible and minimize business impact.

Incident prioritization

Assigning priorities to incidents and defining what constitutes a major incident.

Major incident

An incident that has a high impact and high urgency, requiring a separate process from
incident management.

Major incident manager

The person who is responsible for the MIT and the implementation of the MIM process.

Mean time to acknowledge (MTTA)

A measurement of how quickly an incident is acknowledged by the service desk.

Mean time to detect (MTTD)

A measurement of how quickly a potential threat to a service or configuration item is

detected.

Mean time between failures (MTBF)

A measurement of how frequently a service or asset fails.

Mean time to repair/resolve/respond/recover (MTTR)

A measurement of how quickly a service is restored after failure.

Normal service operation

A service operation that adheres to the service level agreement (SLA).

Problem

A cause or potential cause of one or more incidents.

RACI matrix

It defines the roles and responsibilities in cross-functional or departmental projects and

processes.

Service desk

The point of communication between service providers and the organization's users.

Service desk manager

The one who oversees day-to-day activities of the service desk and is responsible for its
performance.

Service-level objective (SLO)

It defines the objective of the service providers, and is a means of measuring their
performance.

SLA

An agreement between the service provider and the customer about the expected level of
service and the expected time in which it is delivered.

Urgency

A measure of how quickly an incident needs to be resolved.

GTD Intro & Learning Resources at Waterfront (Getting Things Done by David Allen)
100% (1)
GTD Intro & Learning Resources at Waterfront (Getting Things Done by David Allen)
18 pages
TT - Email-Archiving Ch1 REVISE4
No ratings yet
TT - Email-Archiving Ch1 REVISE4
11 pages
GTD System Analysis in Project Management
No ratings yet
GTD System Analysis in Project Management
49 pages
Desktop Support SOP
No ratings yet
Desktop Support SOP
5 pages
Time Management: "Getting Things Done"
No ratings yet
Time Management: "Getting Things Done"
22 pages
Holacracy-Whitepaper-V4.1 1 PDF
No ratings yet
Holacracy-Whitepaper-V4.1 1 PDF
10 pages
Getting Email Under Control
No ratings yet
Getting Email Under Control
4 pages
GTD Presentation
No ratings yet
GTD Presentation
20 pages
5 Simple GTD Alternatives That Will Help You Get All Your Tasks Done
No ratings yet
5 Simple GTD Alternatives That Will Help You Get All Your Tasks Done
9 pages
Remember The Milk Cheat Sheet
No ratings yet
Remember The Milk Cheat Sheet
1 page
Software Testing Professional CV
No ratings yet
Software Testing Professional CV
3 pages
4 GTD Contexts for Creative Projects
No ratings yet
4 GTD Contexts for Creative Projects
4 pages
The Loss of A "Key Person": Risk To The Enterprise How To Manage It?
No ratings yet
The Loss of A "Key Person": Risk To The Enterprise How To Manage It?
7 pages
Operational Categorization ST
100% (1)
Operational Categorization ST
3 pages
F2F Weekly Preview
No ratings yet
F2F Weekly Preview
3 pages
Relearning The Art of Asking Questions - HBR 2015
No ratings yet
Relearning The Art of Asking Questions - HBR 2015
5 pages
Efficient Task Management Guide
No ratings yet
Efficient Task Management Guide
1 page
Weekly Review Process (Giveaway)
No ratings yet
Weekly Review Process (Giveaway)
7 pages
MMT2 IT Strategic Solutiontask 4
40% (5)
MMT2 IT Strategic Solutiontask 4
8 pages
ITIL Essentials for IT Managers
100% (2)
ITIL Essentials for IT Managers
2 pages
RequestManagementProcessDocument v02
No ratings yet
RequestManagementProcessDocument v02
10 pages
Disaster Recovery Plan For Computer Operations
No ratings yet
Disaster Recovery Plan For Computer Operations
30 pages
Change Management
No ratings yet
Change Management
35 pages
18cse383t - Unit 2
No ratings yet
18cse383t - Unit 2
133 pages
Lean Management of IT Organizations A Pe
No ratings yet
Lean Management of IT Organizations A Pe
12 pages
Business Process Analysis
No ratings yet
Business Process Analysis
16 pages
Information Technology Strategic Plan
No ratings yet
Information Technology Strategic Plan
18 pages
Getting Things Done - Complete Summary
No ratings yet
Getting Things Done - Complete Summary
8 pages
ITIL Incident Management LifeCycle Process
No ratings yet
ITIL Incident Management LifeCycle Process
5 pages
Times Leader 09-13-2011
No ratings yet
Times Leader 09-13-2011
42 pages
Beyond Budgeting: Future of Management Accounting
No ratings yet
Beyond Budgeting: Future of Management Accounting
8 pages
GTD On MindMap With MindManager
100% (2)
GTD On MindMap With MindManager
1 page
IWMW 2018 Speaker Guidelines
No ratings yet
IWMW 2018 Speaker Guidelines
19 pages
Randy Pausch - Time Management
No ratings yet
Randy Pausch - Time Management
83 pages
ITIL Overview: Himswan Team
No ratings yet
ITIL Overview: Himswan Team
16 pages
Four Facets of Effective Ciso Leadership
No ratings yet
Four Facets of Effective Ciso Leadership
12 pages
Essential Cyber Hygiene Practices
No ratings yet
Essential Cyber Hygiene Practices
63 pages
Introduction To Cobit Framework - Week 3
No ratings yet
Introduction To Cobit Framework - Week 3
75 pages
ITIL Service Improvement Metrics
No ratings yet
ITIL Service Improvement Metrics
1 page
Cis Atpl 2
No ratings yet
Cis Atpl 2
8 pages
IT Management, Simplified.: Real-Time IT Management Solutions For The New Speed of Business
No ratings yet
IT Management, Simplified.: Real-Time IT Management Solutions For The New Speed of Business
25 pages
10 Rules for Effective Meetings
No ratings yet
10 Rules for Effective Meetings
9 pages
Timemanagementingroupwork 190518175604
No ratings yet
Timemanagementingroupwork 190518175604
23 pages
Leader in Me Certificate
No ratings yet
Leader in Me Certificate
1 page
Checklist of Recommended ITIL® Documents For Processes and Functions
100% (1)
Checklist of Recommended ITIL® Documents For Processes and Functions
23 pages
Defect Tracking and Management: Best Practices by Gabriel Rodriguez
No ratings yet
Defect Tracking and Management: Best Practices by Gabriel Rodriguez
18 pages
SDLC Introduction for Maryland IT
No ratings yet
SDLC Introduction for Maryland IT
17 pages
2018 Global IT CoPlan Guidelines - V1.0
No ratings yet
2018 Global IT CoPlan Guidelines - V1.0
39 pages
ITIL Awareness Training - Ver 1.1
No ratings yet
ITIL Awareness Training - Ver 1.1
33 pages
MIM Bible 1747992840
No ratings yet
MIM Bible 1747992840
13 pages
Major Incidents
No ratings yet
Major Incidents
11 pages
IT Incident Management Guide
No ratings yet
IT Incident Management Guide
8 pages
Effective Incident Problem Management
No ratings yet
Effective Incident Problem Management
8 pages
ITIL - Dealing With Major Incidents PDF
No ratings yet
ITIL - Dealing With Major Incidents PDF
3 pages
50 Incident Management Interview Questions
No ratings yet
50 Incident Management Interview Questions
51 pages
Incident Management in ITIL 4: Download Now: ITIL 4 Best Practice E-Books
No ratings yet
Incident Management in ITIL 4: Download Now: ITIL 4 Best Practice E-Books
5 pages
ITSM Back To Basics Major Incident Management
No ratings yet
ITSM Back To Basics Major Incident Management
4 pages
Incident Management - ITIL 4 Practice Guide
No ratings yet
Incident Management - ITIL 4 Practice Guide
55 pages
Incident Management Policy
100% (2)
Incident Management Policy
18 pages
Incident Management
No ratings yet
Incident Management
4 pages
SLA Definition - Help Center - Powell Software
No ratings yet
SLA Definition - Help Center - Powell Software
5 pages
Ayantu Melkamu
No ratings yet
Ayantu Melkamu
93 pages
EPG Speed Loop Test Set T99331: Product Manual 55028 (Revision C)
No ratings yet
EPG Speed Loop Test Set T99331: Product Manual 55028 (Revision C)
24 pages
ENGIE Services Thailand Menus V2
No ratings yet
ENGIE Services Thailand Menus V2
20 pages
18 - 016 Altanium Mold Controllers 8.5x11 07 11 2018 FINAL
No ratings yet
18 - 016 Altanium Mold Controllers 8.5x11 07 11 2018 FINAL
5 pages
Maintenance KPI Metrics Guide
No ratings yet
Maintenance KPI Metrics Guide
5 pages
SAP Return On Investment
No ratings yet
SAP Return On Investment
9 pages
ABB-Service Level Agreement
No ratings yet
ABB-Service Level Agreement
6 pages
BASF Device Deployment Records
No ratings yet
BASF Device Deployment Records
18 pages
BakerHughes BN System1 DecisionSupport Brochure-090822
No ratings yet
BakerHughes BN System1 DecisionSupport Brochure-090822
7 pages
Evo HTML PDF
No ratings yet
Evo HTML PDF
2 pages
Standardization and Modularity in Data Center Physical Infrastructure
No ratings yet
Standardization and Modularity in Data Center Physical Infrastructure
17 pages
Tricon Lifecycle SE
100% (1)
Tricon Lifecycle SE
10 pages
Toyota Forklift Manual
No ratings yet
Toyota Forklift Manual
12 pages
01 Chapter 01
No ratings yet
01 Chapter 01
26 pages
CH 13
No ratings yet
CH 13
10 pages
Blue Green Deployment
No ratings yet
Blue Green Deployment
22 pages
Data Quality
No ratings yet
Data Quality
10 pages
RRL About RCA in INTERNET CAFE
No ratings yet
RRL About RCA in INTERNET CAFE
4 pages
OEE Training Presentation
100% (3)
OEE Training Presentation
30 pages
Unplanned Downtime Research Papers
No ratings yet
Unplanned Downtime Research Papers
6 pages
Wind Turbine Condition Monitoring Technical and Co
No ratings yet
Wind Turbine Condition Monitoring Technical and Co
22 pages
TPM
No ratings yet
TPM
32 pages
Incident Management
No ratings yet
Incident Management
5 pages
Maintenance Engineering Guide
100% (1)
Maintenance Engineering Guide
15 pages
Marine Project Scope Overview
100% (7)
Marine Project Scope Overview
143 pages
MARC 5 1 24585published
No ratings yet
MARC 5 1 24585published
25 pages
Mid-Life Review OF A Repowered 660MW Boiler - Practical Aspects From Design To Inspection
No ratings yet
Mid-Life Review OF A Repowered 660MW Boiler - Practical Aspects From Design To Inspection
7 pages
The Impact of COVID-19 On The Efficiency of Packing Lines in
No ratings yet
The Impact of COVID-19 On The Efficiency of Packing Lines in
16 pages

Major Incident Management

Uploaded by

Major Incident Management

Uploaded by

Major incident management: An overview

But first, what makes an incident a major incident?

What is a major incident?

 By preventing customers from accessing the organization's services. The Cloudflare

 By disrupting employees' ability to complete their work on time, leading to a

The 4 stages of a major incident

Major incidents are considered to have 4 main stages, namely:

The major incident management process

A major incident management process is a must-have for organizations, as it helps them

Declaring the major incident:

 Key stakeholders: The department heads and service-level business management

Assembling the major incident team

Setting up a conference bridge

Preparing a designated war room

Creating a problem ticket to identify underlying issues

Implementing the resolution plan as a change

Performing a post-implementation review

Producing clear documentation

 Try ServiceDesk Plus

Book a personalized demo

Major IT incident management process flow chart

Major incident management roles and responsibilities

Service desk technicians

Major incident manager

External consultants or third-party vendors

* R - Responsible, A - Accountable, C - Consulted, I - Informed

5 Common mistakes in major incident management

1. Manual communication and escalation

2. Ineffective channels for reporting major incidents

5. Failure to analyze the root cause

5 Major incident management best practices

1. Enable multiple channels for reporting major incidents

2. Automate service desk processes

3. Strive for prompt, relevant communication

It is important to keep your organization's management and important stakeholders

4. Create clear documentation

5. Utilize deep integrations with ITOM software

Book a personalized demoGet a customized quote

Major incident management metrics and KPIs

Mean time to acknowledge (MTTA) The average time to respond to a major As

The average time between failures. It is Th

Percent increase or decrease of major The percent increase of problems in Th

Major incident scenario

Timeline of events from detection to resolution:

The act of transferring ownership of a ticket based on a functional or hierarchical need.

An occurrence that has significance for the management of a service or asset.

A measure of the severity of an incident.

An unplanned interruption to an IT service, or a reduction in the quality of an IT service.

Assigning priorities to incidents and defining what constitutes a major incident.

Major incident manager

Mean time to acknowledge (MTTA)

A measurement of how quickly an incident is acknowledged by the service desk.

Mean time to detect (MTTD)

A measurement of how quickly a potential threat to a service or configuration item is

Mean time between failures (MTBF)

A measurement of how frequently a service or asset fails.

Mean time to repair/resolve/respond/recover (MTTR)

A measurement of how quickly a service is restored after failure.

Normal service operation

A cause or potential cause of one or more incidents.

It defines the roles and responsibilities in cross-functional or departmental projects and

Service desk manager

Service-level objective (SLO)

A measure of how quickly an incident needs to be resolved.

You might also like