Business Continuity and
Disaster Recovery
Business continuity planning (BCP) and contingency planning in
support of operations are elements of a system of internal control
that is established to manage availability of critical processes in the
event of interruption. The most important part of such a plan deals
with the cost-effective support of the information system.
Availability of business data is vital to the sustainable development
and/or even to the survival of any organization.
Overview of Business Continuity &
Disaster Recovery
BCP and disaster recovery planning (DRP)
processes
Business impact analysis (BIA)
Recovery strategies and alternatives
Plan testing
Backup and restoration
Audit considerations
Examples of Corporate Risks
Inability to maintain critical customer services
Damage to market share, image, reputation
or brand
Failure to protect the company assets,
including intellectual properties and
personnel
Business control failure
Failure to meet legal or regulatory
requirements
The Purpose of Business
Continuity/Disaster Recovery
The purpose of business continuity/disaster
recovery is to enable a business to continue
offering critical services in the event of a
disruption and to survive a disastrous
interruption to their activities. Rigorous
planning and commitment of resources is
necessary to adequately plan for such an
event.
Disasters and other Disruptive Events
Earthquakes
Floods
Tornados
Thunderstorms
Fire
Discontinuation of Services like Electrical Power,
Telecommunications, Natural Gas
Terrorist Attacks
Hacker Attacks
Virus Attacks
System Malfunctions
Accidental File Deletions
Network Denial of Services (DoS) Attacks
Intrusion
Business Continuity Planning Process
The BCP process can be divided into the following life cycle phases:
Creation of a business continuity policy
Business Impact Analysis (BIA)
Classification of operations and criticality analysis
Identification of IS processes that support critical organizational
functions
Development of a BCP and IS disaster recovery procedures
Development of resumption procedures
Training and awareness program
Testing and implementation of plan
Monitoring
Recovery Point Objective and Recovery Time Objective
Interruption window: The time the organization can wait from the point
of failure to the critical services/applications restoration. After this time,
the progressive losses caused by the interruption are unaffordable.
Service delivery objective (SDO): Level of services to be reached during
the alternate process mode until the normal situation is restored. This is
directly related to the business needs.
Maximum tolerable outages: Maximum time the organization can
support processing in alternate mode. After this point, different problems
may arise, especially if the alternate SDO is lower than the usual SDO, and
the information pending to be updated can become unmanageable.
Hot Sites
They are fully configured and ready to operate within several
hours. The equipment, network and systems software must
be compatible with the primary installation being backed up.
The only additional needs are staff, programs, data files and
documentation.
Costs associated with the use of a third-party hot site are
usually high, but less than creating a redundant site, and are
often cost justifiable for critical applications.
Warm Sites
They are partially configured, usually with network
connections and selected peripheral equipment, such as disk
drives and other controllers, but without the main computer.
Sometimes a warm site is equipped with a less-powerful
central processing unit (CPU), than the one generally used.
The assumption behind the warm site concept is that the
computer can usually be obtained quickly for emergency
installation (provided it is a widely used model) and, since the
computer is the most expensive unit, such an arrangement is
less costly than a hot site. After the installation of the needed
components, the site can be ready for service within hours;
however, the location and installation of the CPU and other
missing units could take several days or weeks.
Cold Sites
They have only the basic environment (i.e., electrical wiring,
air conditioning, flooring, etc.) to reduce the cost. The cold
site is ready to receive equipment, but does not offer any
components at the site in advance of the need. Activation of
the site may take several weeks.
Mobile Sites
This is a specially designed trailer that can be quickly transported to
a business location or to an alternate site to provide a ready-
conditioned facility. These mobile sites can be connected to form
larger work areas and can be preconfigured with servers, desktop
computers, communications equipment, and even microwave and
satellite data links. They are a useful alternative when there are no
recovery facilities in the immediate geographic area. They are also
useful in case of a widespread disaster and are a cost-effective
alternative to duplicate for a multi-office organization.
Organization and Assignment of
Responsibilites
Incident response team: A team that has been designated to receive the information about
every incident that can be considered as a threat to assets/processes. This reporting can be
useful for coordinating an incident in progress and or for postmortem analysis. The analysis of
all incidents also provides input for updating the recovery plans.
Emergency action team: They are first responders, designated fire wardens and bucket
crews, whose function is to deal with fires or other emergency response scenarios. One of
their primary functions is the orderly evacuation of personnel and the securing of human life.
Information security team: The main mission of this team is to develop the needed steps to
maintain a similar level of information and IT resource security as was in place in at the
primary site before the contingency, and implement the needed security measures in the
alternative procedures environment. Additionally, this team must continually monitor the
security of system and communication links, resolve any security conflicts that impede the
expeditious recovery of the system, and assure the proper installation and functioning of
security software. The team is also responsible for the security of the organization's assets
during the disorder following a disaster.
Organization and Assignment of
Responsibilites
Damage assessment team: Assesses the extent of damage following the disaster. The team
should be comprised of individuals who have the ability to assess damage and estimate the
time required to recover operations at the affected site. This team should include staff skilled
in the use of testing equipment, knowledgeable about systems and networks, and trained in
applicable safety regulations and procedures. In addition, they have the responsibility to
identify possible causes of the disaster and their impact on damage and predictable downtime.
Emergency management team: Responsible for coordinating the activities of all other
recovery/continuity/response teams and handling key decision making. They determine the
activation of the BCP. Other functions entail arranging the finances of the recovery, handling
legal matters evolving from the disaster, and handling public relations and media inquiries.
Offsite storage team: Responsible for obtaining, packaging and shipping media and records
to the recovery facilities, as well as establishing and overseeing an offsite storage schedule for
information created during operations at the recovery site.
Software team: Responsible for restoring system packs, loading and testing operating
systems software, and resolving system-level problems.
Applications team: Travels to the system recovery site and restores user packs and
application programs on the backup system. As the recovery progresses, this team may have
the responsibility of monitoring application performance and database integrity.
Test Execution
Pretest: The set of actions necessary to set the stage for the actual test. This
ranges from placing tables in the proper operations recovery area to
transporting and installing backup telephone equipment. These activities are
outside the realm of those that would take place in the case of a real
emergency, in which there is no forewarning of the event and, therefore, no
time to take preparatory actions.
Test: This is the real action of the business continuity test. Actual operational
activities are executed to test the specific objectives of the BCP. Data entry,
telephone calls, information systems processing, handling orders, and
movement of personnel, equipment and suppliers should take place.
Evaluators review staff members as they perform the designated tasks. This is
the actual test of preparedness to respond to an emergency.
Post-test: The cleanup of group activities. This phase comprises such
assignments as returning all resources to their proper place, disconnecting
equipment, returning personnel, and deleting all company data from third-
party systems. The post-test cleanup also includes formally evaluating the
plan and implementing indicated improvements.
Type of Tests
Desk-based evaluation/paper test: A paper walk-through of the
plan, involving major players in the plan's execution who reason out
what might happen in a particular type of service disruption. They
may walk through the entire plan or just a portion. The paper test
usually precedes the preparedness test.
Preparedness test: Usually a localized version of a full test, wherein
actual resources are expended in the simulation of a system crash.
This test is performed regularly on different aspects of the plan and
can be a cost-effective way to gradually obtain evidence about how
good the plan is. It also provides a means to improve the plan in
increments.
Full operational test: This is one step away from an actual service
disruption. The organization should have tested the plan well on
paper and locally before endeavoring to completely shut down
operations. For purposes of the BCP testing, this is the disaster.