0% found this document useful (0 votes)

102 views20 pages

Network Troubleshooting: by Othmar Kyas

A high-performance, high-availability information technology (IT) system is becoming a prerequisite for successful execution of the business practices that are decisive in maintaining a leadership position in today's competitive markets.

Uploaded by

Bobby Solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views20 pages

Network Troubleshooting: by Othmar Kyas

Uploaded by

Bobby Solomon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

Excerpts taken from:

Network Troubleshooting
By Othmar Kyas
An Agilent Technologies Publication

Section I

Basic Concepts
Chapter 1

Network Availability
1.1 1.2 1.3 1.4 The Strategic Importance of Information Technology Intranets and the Internet: Revolutions in Network Technology The Behavior of Complex Network Systems: Catastrophe Theory The Causes of Network Failure 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 Operator Error Mass Storage Problems Computer Hardware Problems Software Problems Network Problems

1.5 Calculation and Estimation of Costs Incurred Due to Network Failures 1.5.1 1.5.2 Immediate Costs Consequential Costs

1.6 High Availability and Fault Tolerance in Networks 1.7 Summary

For additional excerpts from this chapter and other Network Troubleshooting book sections, be sure to regularly visit our web site at:

www.FreeTroubleshootingBook.com
New chapters will be posted every 2 to 3 weeks. Be sure to visit our web site and vote for the chapters you would like to see posted!

NETWORK AVAILABILITY

Network Availability
Waiting for an alarm is not the ideal form of network management. BOB BUCHANAN, THE NETWORK JOURNAL

1.1

The Strategic Importance of Information Technology

Growing financial and competitive pressures in the business world mean that companies everywhere must continuously optimize their internal and external structures in order to survive. All business processes and routines must be reviewed regularly for effectiveness (Are we doing the right things?) and for efficiency (Are we doing things right?). Most business processes today consist of physical activities, such as the manufacture of a metal part, combined with information flow: How many parts should be produced? When? In what sizes? Increasingly, key business processesin insurance companies, travel agencies, banks, and airlines, for exampleconsist entirely of information flow. Today, of course, the flow of information is largely dependent on information technology that is, computers, databases and networks. A high-performance, high-availability information technology (IT) system is becoming a prerequisite for successful execution of the business practices that are decisive in maintaining a leadership position in todays competitive markets. The role of the computer in business has changed radically over the past few years, from a tolerated plaything to a cornerstone of corporate infrastructures. This change has taken place so rapidly that in many companies IT still has not taken a central position in managerial circles, even though it has long since become indispensable for day-to-day business functions. The reliance of enterprises on smoothly functioning IT infrastructures will continue to grow in the coming years. Areas of business that until recently had little to do with computer technology, such as marketing and customer service, are increasingly IT-based. This is largely due to the advent of customer interfaces that allow consumers to perform many transactions electronically, such as placing orders or making reservations. In fact, the proportion of people who work directly or indirectly with IT has grown in recent years to more than 50 percent (see Figure 1.1).

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

P e rc e n ta g e

A g r a r ia n a g e 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 1 8 0 0 1 8 5 0

In d u s tr ia l a g e

In fo r m a tio n a g e

In fo r m a tio n

S e r v ic e s P r o d u c tio n A g r ic u ltu r e 1 9 0 0 1 9 5 0 2 0 0 0
Y e a r

S o u r c e : L e o a . N e fio d o w , " T h e fifth K o n d r a tie ff"

Figure 1.1 Changes in employment patterns in Western industrial countries since 1800

In d u s try F in a n c ia l s e r v ic e s F in a n c ia l s e r v ic e s M e d ia R e ta il R e ta il T r a v e l/to u r is m S h ip p in g

B u s in e s s p r o c e s s e s S to c k tr a d in g C r e d it c a r d /te le c a s h tr a n s a c tio n s P a y - p e r - v ie w H o m e s h o p p in g ( T V ) M a il o r d e r ( c a ta lo g ) A ir lin e r e s e r v a tio n P a r c e l s e r v ic e

D o w n tim e c o s ts p e r h o u r ( U S $ ) 6 ,0 0 0 ,0 0 0 2 ,4 0 0 ,0 0 0 1 5 0 ,0 0 0 1 0 0 ,0 0 0 8 0 ,0 0 0 8 2 ,5 0 0 2 5 ,5 0 0

S o u rc e : A T & T /G a rtn e r G ro u p

Figure 1.2 Costs of network failure

NETWORK AVAILABILITY

In keeping with these developments, the professional operation and management of computer networks has long since ceased to be a necessary evil. On the contrary, it has become a decisive strategic necessity for the success of almost any enterprise. A network failure that lasts only a few hours can cost millions of dollars. According to a study carried out by AT&T, companies that deal in financial services, such as investment brokerages or credit card firms, can suffer losses of 2.5 to 5 million dollars from just 1 hour of network downtime (see Figure 1.2).

1.2

Intranets and the Internet: Revolutions in Network Technology

The difficulties involved in the professional operation of high-performance data networks have been further complicated by the Internet revolution, which has brought about radical changes in network technology and applications. Since the mid-1990s the Internet has not only developed into a universal communications medium, but has also become a global marketplace for the exchange of goods and services. As a result, growing numbers of business are faced with the necessity of providing their employees with Internet access. Special network infrastructures are now required in order to provide electronic access for increasing numbers of Internet-based consumers. Once it was sufficient to have just a few carefully controlled wide-area network (WAN) links in an otherwise homogenous local-area network (LAN). Today, however, a secure, high-performance LAN-WAN structure is indispensable. Internet technologies are also being introduced into company networks, leading to the development of corporate intranets. This has necessitated further restructuring so that broad areas of internal data processing can be adapted to the transport mechanisms, protocols and formats used in the Internet. All of these developments have caused the World Wide Web (WWW) to take on a position of global importance as a uniform user interface. At the same time, these changes have placed enormous demands on network managers. In many cases, the skills and tools available for managing computer systems and networks can barely keep up with the increasing complexity of data network structures. And to add to the difficulty of the task, the technology cycles in data communicationsthe intervals at which new and more powerful data communication technologies are introducedare getting shorter all the time. Whereas the classic 10 Mbit/s Ethernet topologies shaped computer networking throughout the 1980s, the 1990s have seen the introduction of new technologies almost every year, including LAN switching, 100 Mbit/s Ethernet, Gigabit Ethernet, ATM (Asynchronous

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

Transfer Mode), IP (Internet Protocol) switching, Packet over Sonet (PoS) and ADSL (Asynchronous Digital Subscriber Mode), to name just a few. Product life cycles in the IT field are often measured in months now rather than years. This rapid pace of technological development puts manufacturers and users alike under tremendous pressure to keep abreast of constant innovation (see Figure 1.3).

A T M 1 5 0 0 M b it/s 1 0 0 0 M b it/s S ta n d a r d s in d e v e lo p m e n t A d o p te d s ta n d a rd s

( 5 2 M b it /s , 1 0 0 M b it/ s , 1 5 5 M b it/s , 6 2 2 M b it/s , 2 .4 G b it/s , 9 .9 G b it/s ) 1 0 G ig a b it/s E th e r n e t

G ig a b it E t h e r n e t ( 1 0 0 0 M b it/s )

A T M b a n d w id th s a r e in d ic a te d p e r s w itc h p o r t. A 1 5 5 M b it/s A T M s w itc h w ith 1 0 p o r ts r e p r e s e n ts a s u b d iv id e d b a n d w id th o f 1 .5 G b it/s !

1 5 5 M b it/s 1 3 3 M b it/s F D D I-2

F F O L

F ib r e C h a n n e l ( 1 3 3 M b it/s , 2 6 6 M b it /s , 5 3 0 M b it /s , 1 G b it /s ) C D D I

1 0 0 M b it/s

F D D I

1 0 0 B a s e -T T C N S ( T h o m a s C o n r a d N e tw o r k in g S y s te m ) IE E E 8 0 2 .1 2 (1 0 0 B a s e -V G -A n y L A N ) L A N S w itc h in g ( n x 1 0 M b it/s , n x 1 0 0 M b it/s )

F r a m e R e la y ( 1 .5 4 5 M b it/s ) T o k e n R in g ( 1 6 M b it/s ) 1 6 M b it/s 1 0 M b it/s E th e rn e t T o k e n R in g ( 4 M b it/s ) x D S L ( 1 6 M b it/s ) Is o c h ro n o u s E th e rn e t

1 9 8 2

1 9 8 4

1 9 8 6

1 9 8 8

1 9 9 0

1 9 9 2

1 9 9 4

1 9 9 6

1 9 9 8

2 0 0 0

Figure 1.3 The development of data transmission technologies: 1980 to 2000

NETWORK AVAILABILITY

1.3

The Behavior of Complex Network Systems: Catastrophe Theory

The enormous technological complexity in combination with the large numbers of hardware and software components used in networks makes operation and management a difficult task, to say the least. Communication media, connectors, hubs, switches, repeaters, network interface cards, operating systems, data protocols, driver software, and application software must all function smoothly under widely varying conditions, including network load, number of nodes connected, and size of data packets transmitted. Even when a given system has attained a relatively stable operating state, its stability is constantly put to the test by dynamic variations as well as by operator errors, administrative errors, configuration changes, and hardware and software problems. In general, the more complex a system is and the greater the number of parameters that influence it, the more difficult it is to predict its behavior. Catastrophe theory (see Ren Thom, 1975) offers an excellent model for describing the behavior of systems as complex as computer networks. This theory can provide at least qualitative descriptions of system behavior, especially for non-linear operating states, such as those that often accompany a network breakdown. Catastrophe theory postulates seven elementary catastrophes, which behave in a given manner according to the number rather than the type of control parameters influencing the system. The behavior model for catastrophes determined by two parameters, for example, is called a cusp graph. The cusp is a three-dimensional surface whose upper side represents balanced states, while the lower surface represents unstable maxima. Catastrophe theory can be applied to Ethernet networks, for example, to show the effects of two control parameters, slot time and network load, on throughput. Slot time, which is defined as twice the time it takes a signal to travel between the two nodes that are farthest apart in an Ethernet segment, is influenced by the network components that cause signal transmission delay or latency, such as cables, repeaters or hubs. Figure 1.4 shows the behavior pattern for throughput when all other variables, such as network load, average packet size and number of network nodes, are constant. An increase in traffic in a network with a given slot time a moves the operating state across the upper surface of the cusp. The rise along the x-axis indicates increasing throughput. Starting from the higher slot time b, however, the same increase in traffic drastically reduces network efficiency. All processes take place on the surface of the cusp and are thus linear. If the operating state is a when the network load increases, and subsequently the slot time increases from state c (Figure 1.5), an abrupt departure from the balanced state takes place at point d in order to arrive directly at point e. Point e represents a stable operating

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

state, but one in which throughput is minimal. The abrupt transition from d to e constitutes a catastrophe. The model provided by catastrophe theory clearly illustrates how complex and unpredictable a network can be. The symptoms that indicate problems in a network are often caused by a series of errors. One event triggers another, and the resulting state yet another, and so on. Feedback may either amplify or reduce the effects of error events. When the error symptom is finally detected, it may be far removed from its original locus in a completely different form and appear to have been triggered by some trivial event.

S lo t T im e
d o a k L o r tw N e

T h ro u g h p u t

F o ld

U n s ta b le Z o n e

Li n

Figure 1.4 Catastrophe with two control parameters: the cusp

D iv e rg e n c e

ld F o e L in

B im o d a lity

NETWORK AVAILABILITY

S lo t T im e
d o a k L o r tw N e

a b

c d

C a ta s tro p h e f T h ro u g h p u t C a ta s tro p h e 0 e

H y s te r e s is
Figure 1.5 A non-linear operating state (network failure) in an Ethernet network

1.4

The Causes of Network Failure

There are five categories of errors that can lead to system failure: Operator error Mass storage problems Computer hardware problems Software problems Network problems

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

1.4.1

Operator Error

On the average, operator error is responsible for over 5 percent of all system failuresa large enough proportion to merit a closer look. Operator errors can be classified as intentional or unintentional mistakes, and as errors that do or do not cause consequential damage. The term intentional error does not necessarily indicate that the error itself was the operators intent, but rather that it resulted from some intentional action, such as trying to take a shortcut. The belief that a given process can be shortened, or that certain quality control or safety guidelines are superfluous, can lead to error situations with or without consequential damage. Less common are the truly intentional errors motivated, for example, by an employees desire for revenge against a superior or the company, by the desire to cause trouble for a colleague (by making mistakes that the colleague will be blamed for), or out of destructiveness brought on by general frustration. Unintentional errors usually result either from insufficient understanding of a given process or from poor concentration. Other common causes include software and hardware errors (the system does not behave as it should even though it is configured and operating correctly) or installation and configuration errors (errors occur when the system is operating correctly and the software or hardware is functioning according to specification). Sometimes a series of minor errors, which individually go undetected because no harmful effects are noted, are eventually compounded so that serious errors or even system failures result.

1.4.2

Mass Storage Problems

Problems with hard disks are the most common cause of failures in data processing. More than 26 percent of all system failures can be traced to faults in mass storage media. Although high-performance mass storage can attain a mean time between failures (MTBF) of over 106 hours, this could still mean replacing hard disks almost every month if the system has a large number of disk drives. There is usually a wide gap between theoretical MTBF and the operational MTBF that can be achieved in practice. The probability that a hard disk drive with a theoretical MTBF of 106 hours (almost 114 years) will actually run that long without error is only 30 percent. To calculate the number of hard disks that will have to be replaced within a certain period of time in a given system, multiply the total number of hard disk drives in the system by the period of system service in hours, and then divide this number by the theoretical MTBF. For example, in a system that has 1,000 disk drives, each of which has a theoretical MTBF of 106 hours, the number of failures A in the first 5 years (43,800 hours) comes to 44 (see the following equation).

NETWORK AVAILABILITY

1 ,0 0 0 4 3 ,8 0 0 A = 1 ,0 0 0 ,0 0 0

h o u r s d is k h o u r s d is k =

4 4

This is based on the assumption that all of the hard disk systems have the same MTBF and are operated under similar conditions. Tests have shown that mass storage units operating in warm ambient conditions tend to show a lower actual MTBF than those operated in well-cooled environments. Furthermore, frequent disk search operations and changes in location have both been shown to have negative effects on the service life of mass storage media. For this reason, some hard disk manufacturers use another value in addition to the theoretical and operational MTBF to indicate the probable period of error-free operation for their products. This value, called the cumulative distribution function (CDF), indicates the probability that a mass storage medium will fail within a specified time. For example, a CDF of 4 percent over 5 years means that there is a 4 percent chance that the medium in question will break down within the first 5 years of use.

5 .9 % 2 1 .1 % 2 6 .4 % U s e rs D a ta n e tw o rk s S o ftw a re C o m p u te r h a rd w a re 2 2 .3 %
S o u r c e : F in d /S V P

2 4 .3 %

M a s s s to ra g e

Figure 1.6 Causes of system failures in data processing

1.4.3

Computer Hardware Problems

Roughly one-quarter of all system failures are caused by computer hardware problems. By definition, this includes problems with any computer hardware component, including monitor, keyboard, mouse, CPU, RAM, hard disks and floppy disk drives. The average error-free service life of a system is calculated from the sum of the MTBF values of its components divided by the number of components. The following are some average MTBF values for various computer system components:

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

RAM chips: 8,000,000 hours Floppy disk drives, mice, CD-ROM drives: 2,000,000 hours 10Base-T interface cards: 5,000,000 hours FDDI, ATM interface cards: 400,000 hours CPUs: 100,000 hours

The MTBF values calculated for todays computer systems average between 10,000 and 50,000 hours. In general, the more complex a system is, the lower the average MTBF. A system with multiple processors and multiple network links, for example, is more error-prone than a comparatively simple server with only one processor. The Annual Failure Rate (AFR) is a better indicator of reliability than the MTBF. The AFR is the MTBF divided by the number of hours per year that the system is in operation. When a server system with a MTBF of 25,000 hours is in constant operation, the AFR amounts to 25,000/8,760 = 2.8, or about 3 failures every year. Another important parameter for the availability of computer systems is the mean time to repair (MTTR), which indicates the average length of time it takes to repair the system after a failure. The MTTR is the total repair time divided by the number of system failures. Typical MTTR values lie between 2 and 3 hours when the repair time used in the calculation is the amount of work time actually spent repairing the system.

1.4.4

Software Problems

Software problems cause almost as many failures as hardware problems do. The widespread use of client-server architectures and distributed platforms in enterprise networks have led to such complex combinations of software that it is almost impossible to monitor system behavior under all network loads and in all operating states. In the age of corporate intranets and the Internet, the update schedules for software applications are becoming shorter all the time, so that sufficient time is not allowed for detailed testing before software is released. Automatic testing tools, such as LoadRunner (from Mercury Interactivewww.loadrunner.com) or AutoTester (from AutoTester Inc www.autotester.com ), which attempt to simulate various extreme operating situations, provide only limited assistance. Problems with new software that can lead to system failure arise not only at the application level, but also as a result of unstable software drivers, faulty installation or backup procedures, or operating system errors.

NETWORK AVAILABILITY

1.4.5

Network Problems

The fifth major category of IT problems encompasses errors that occur within the network itself. When the software and hardware problems that are directly related to network operation are included in this categorysuch as problems with network interface cards or with certain components of application software, protocols and card driversthis group accounts for more than one-third of all IT failures. These network errors can be classified by OSI layer. As shown in Figure 1.7, 30 percent of all LAN errors occur on OSI layers 1 and 2. Typical causes are defective cables, connectors, or interface cards; defective modules in hubs, bridges or routers; collisions (in Ethernet networks); beacon processes

P h y s ic a l L a y e r (8 0 % )

P h y s ic a l L a y e r (9 0 % )

H ig c o n p ro (2 0

h e r la y e r s , fig u r a tio n e r r o r s , to c o l e rro rs , e tc . % )

H ig c o n p ro (1 0

h e r la y e r s , fig u r a tio n e r r o r s , to c o l e rro rs , e tc . % )

W id e - A r e a N e tw o r k s : L e a s e d L in e s

W id e - A r e a N e tw o r k s : IS D N
P h y s ic a l L a y e r (2 0 % )

A p p lic a tio n L a y e r (2 0 % )

P r e s e n ta tio n L a y e r (5 % ) S e s s io n L a y e r (5 % )

D a ta L in k L a y e r (1 0 % )

T ra n s p o rt L a y e r (1 5 % )

N e tw o rk L a y e r (2 5 % )

F r e q u e n c y d is tr ib u tio n o f L A N e r r o r s b y O S I m o d e l la y e r s
Figure 1.7 Distribution of data network problems in local- and wide-area networks

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

(in Token-Ring networks); checksum errors, and incorrect packet sizes. The development and implementation of more reliable hardware components, coupled with the continuous improvement of cabling systems, have meant a decrease in the absolute numbers of these types of errors, but developments in software have had similar effects on the higher OSI layers as well. More stable network operating systems and applications as well as mature protocol stacks have also reduced the number of failures per network segment. As a result, the distribution of error sources over the seven OSI layers remains roughly the same over the past years. In wide-area networks (WANs), the proportion of errors occurring on the physical layer is even higher. Where permanent WAN links (leased lines) are employed, 80 percent of all errorsin the case of ISDN (Integrated Services Digital Network), as many as 90 percent of all errorscan be traced to component failure, defective modems, or cable and connector faults (see Figure 1.7).

1.5

Calculation and Estimation of Costs Incurred Due to Network Failures

It is becoming increasingly important to estimate the costs that are incurred in the event of network failure. These can be difficult to quantify, however. Nonetheless, a fairly clear idea of the financial impact of system failure is essential in order to determine the optimum infrastructure dimensions from the perspective of network management and maintenance. Knowing the costs of system failure enables the enterprise to make informed decisions regarding the level of investment in redundant components or network management and troubleshooting systems. All too often the costs of system failure are grossly underestimated. It may be true that the exorbitant losses of $100,000 per minute and more reported in some superficial studies apply only to a few special cases, such as when system failure affects production control systems, financial services offered by credit card companies, or investment brokerages. Nonetheless, the consequences of a network breakdown even in smaller- or medium-sized companies should not be underrated. The average availability of a data network today is between 98 and 99 percent. A system that is in operation 10 hours a day, 5 days a week can expect network downtime totaling between 52 and 104 hours per year. If an average of 100 employees are affected by a network failure, this means a maximum loss of productivity between 5,200 and 104,000 hours. This type of oversimplified calculation, however, quickly leads to inflated figures that do not necessarily reflect real situations.

NETWORK AVAILABILITY

The first step toward a more realistic analysis of network downtime costs is to distinguish between immediate costs incurred within the first 24 hours following the failure and consequential costs that arise after the first 24 hours. Costs in each of these categories are further divided into direct and indirect costs. Direct costs include all expenditures that are directly involved in correcting the network problem, while indirect costs include such factors as lost employee productivity and delayed project completion.

1.5.1
Direct Costs

Immediate Costs

(Costs arising within the first 24 hours)

Replacement parts (network cards, cable, repeaters, hubs, etc.) New components (bridges, routers, servers, etc.) Rental or purchase of diagnostic equipment (network analyzers, cable testers, etc.) Consulting fees charged by network specialists Consulting fees charged by software/hardware manufacturers Overtime compensation for network support staff Indirect Costs Loss of employee productivity at computer workstations Loss of productivity on production lines, in shipping and receiving departments, or in warehouse management; downtime of automated warehousing systems, etc. Loss of consumer or customer orders and confidence Easiest to calculate are the direct immediate costs, such as the purchase of replacement components or consultants fees, because these are automatically documented by invoices. In mid-sized networks (around 500 nodes) with an availability of 99 percent, the average downtime of 52 hours results from an average of 10 to 20 failures of the network or parts of it, lasting between 1 and 5 hours each. If the direct immediate costs of solving the problem average $1,250 per case, then the direct cost of restoring operation after 10 failures comes to $12,500. Quantifying the indirect immediate costs is more difficult. In general, only the cumulative loss of employee productivity is calculated. The extent of this loss, however, depends mainly on the degree to which employee productivity is dependent on network availability. Often a number of employee activities can be postponed until the next day, or at least for a few hours, without significant loss

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

of productivity. In mid-sized office environments, therefore, loss of employee productivity is usually estimated at roughly 25 percent of the total network downtime. For example, if an average of 100 employees are affected by the network breakdown, at 52 hours of downtime per year the loss of productivity amounts to 52 0.25 100 = 1,300 hours. If the average gross salary costs come to $40 per hour, this puts the immediate indirect costs at $52,000.

1.5.2
Direct Costs

Consequential Costs

(Costs arising after the first 24 hours)

New or adjusted hardware configuration in the network (restructuring of servers, bridges, etc.) Testing of other network segments for errors similar to those that caused the failure Documentation of the system failure Indirect Costs Delayed project completion (product development, production, etc.) Delayed services (tenders, invoices, entering transactions in accounts, etc.) Loss of customer loyalty and satisfaction Consequential indirect costs resulting from network failure are the most difficult to calculate. These costs are also referred to as company losses because they cannot be attributed to any one department or cost center. The amount of such costs is proportional to the degree to which the company depends on network-supported processes. Tenders may have to be printed and sent a day later than planned, for example. Incoming orders and payments may be similarly delayed. Incoming deliveries may be blocked if receiving slips cannot be printed or automated warehousing equipment cannot be operated. Urgent shipments sent by special courier result in higher shipping rates. Late charges may be incurred for bills that cannot be processed. Sales may be lost due to unavailable Web-based ordering systems. Customers may grow dissatisfied if they cannot reach a support hotline, which means a loss of future orders. These are just a few examples of company losses as consequential costs of network failures. At a fairly low estimate of $1,000 in consequential indirect costs and $250 in consequential direct costs per failure, the total loss per year in this category, based on the conditions described previously, is $77,000. This means each hour of downtime costs the company $1,480. Or, to look at the case from another perspective, an improvement of a mere 0.1 percent in network availability saves the company $7,700.

NETWORK AVAILABILITY

N e tw o r k a v a ila b ility A n n u a l d o w n tim e ( h o u r s ) N u m b e r o f e m p lo y e e s a ffe c te d p e r fa ilu r e D e p e n d e n c y o f e m p lo y e e p r o d u c tiv ity o n n e tw o r k a v a ila b ility A v e r a g e a n n u a l fa ilu r e s A v e r a g e d ir e c t, im m e d ia te c o s ts p e r y e a r / p e r fa ilu r e ( r e p la c e m e n t p a r ts , e tc .) A v e r a g e in d ir e c t, im m e d ia te c o s ts ( lo s s o f e m p lo y e e p r o d u c tiv ity ) A v e r a g e d ir e c t, c o n s e q u e n tia l c o s ts p e r y e a r / p e r fa ilu r e ( p la n n in g , fa ilu r e d o c u m e n ta tio n , r e c o n fig u r a tio n ) A v e r a g e in d ir e c t, c o n s e q u e n tia l im m e d ia te c o s ts p e r y e a r / p e r fa ilu r e ( c o m p a n y lo s s e s ) A n n u a l n e tw o r k fa ilu r e c o s ts H o u r ly n e tw o r k fa ilu r e c o s ts N e tw o r k fa ilu r e c o s ts p e r 0 .1 % d o w n tim e

9 9 % 5 2 1 0 0 2 5 % 1 0 $ 1 2 ,0 0 0 ( $ 1 ,2 0 0 p e r fa ilu r e )

1 0 0 0 .2 5 5 2 = 1 3 0 0 h $ 4 0 = $ 5 2 ,0 0 0 $ 2 ,5 0 0 ($ 2 5 0 )

$ 1 0 ,0 0 0 ($ 1 ,0 0 0 ) $ 7 7 ,0 0 0 $ 1 ,4 8 1 $ 7 ,7 0 0

Figure 1.8 Calculating the costs of network downtime

1.6

High Availability and Fault Tolerance in Networks

High-availability data processing infrastructures have become a basic requirement for smooth business processes in commercial data processing. Barring special measures taken to maximize network availability, the average availability of todays IT systems is between 98 and 99 percent, which corresponds to a total annual downtime of 50 to 100 hours. For a growing number of companies, however, even this is too much downtime. Special systems can be added to boost network availability to between 99.9 and 99.9999 percent (99.999 percent uptime is equivalent to 6.8 minutes downtime in one year). In this way the average downtime-per-year can be reduced to a few hours or even, in the extreme case, a few minutes. The costs of availability, however, increase almost exponentially with each additional decimal place. Before planning a high-availability system, it is impor-

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

A r c h ite c tu r e U n in te r r u p tib le o p e r a tio n F a u lt to le r a n c e F a il- o v e r b y c lu s te r F a u lt r e s ilie n c e ( fa il- o v e r ) H ig h a v a ila b ility S ta n d a rd s y s te m

S o u rc e : A T & T /G a rtn e r/T P P C

A v a ila b ility 1 0 0 % 9 9 .9 9 9 9 % 9 9 .9 9 9 % 9 9 .9 9 % 9 9 .9 % 9 9 %

T y p ic a l fa ilu r e d u r a tio n N o n e T ic k s T ic k s to s e c o n d s S e c o n d s to m in u te s M in u te s H o u rs

A n n u a l d o w n tim e N o n e 0 .5 m in u te s U p to 5 m in u te s U p to 5 0 m in u te s U p to 8 h o u rs S e v e ra l d a y s

Figure 1.9 Availability levels and downtime

tant to specify exactly what service levels are required. This determines the degree of availability that must be guaranteed. Availability is expressed as a percentage, calculated from the total operating time and the downtime:
A v a ila b ilit y = t o t a l o p e r a t in g t im e d o w n t im e t o t a l o p e r a t in g t im e

Another important factor is the average downtime resulting from system failure, which is called the mean time to repair or MTTR. In most cases, a large number of short service interruptions, lasting only seconds or minutes, is acceptable, while just a few failures that last for several hours each have serious consequences. The main prerequisite for a high-availability infrastructure is the use of highquality components. Even without any special equipment or configuration for ultra-high availability, the quality of components is an important factor in the reliability of hardware and software. Component quality is also decisive for the performance of diagnostic tools and system and network management applications, as well as for the level of maintenance and support that can be attained. If no concessions are made in these areas, the availability of the data processing structure is bound to be significantly above average. Availability can only be improved beyond this level by the addition of components and services. These can include: Redundant components Software and hardware switching

NETWORK AVAILABILITY

Detailed planning of every scheduled downtime Reduction of system administration tasks Development of automatic error reaction systems Thorough acceptance testing prior to installation of new hardware or software components Specifications and practice drills for operator response to system failure Replicated databases and application software Clustering Redundant components can reduce the number of single points of failure in the network. When a given network component fails, its redundant counterpart is activated automatically. If the installation of fallback components is combined with software and hardware switching technologies, the redundant components can take over for malfunctioning components within seconds or even fractions of seconds. Reducing the level of interaction between the network and adminis-

D e fin e a v a ila b ility g o a l D e fin e m a x im u m fa ilu r e d u r a tio n

D e te r m in e c u r r e n t a v a ila b ility C h o o s e a h ig h - a v a ila b ility a r c h ite c tu r e M o d ify /a d d a p p lic a tio n s o ftw a r e fo r H A u s e D e v e lo p p r o c e d u r e s fo r s y s te m fa ilu r e s

T r a in a d m in is tr a to r s a n d o p e r a to r s T e s t p r o c e d u r e s a t r e g u la r in te r v a ls D o c u m e n t a n d m o n ito r c u r r e n t s y s te m s ta te

Figure 1.10 The introduction of a high-availability (HA) system

SECTION I BASIC CONCEPTS

SECTION II

SECTION III

SECTION IV

SECTION V

trator is a useful step in establishing deterministic reactions to different error scenariosideally, a given error should consistently trigger a single, defined process. The individual steps involved in introducing a high-availability system are shown in Figure 1.10.

1.7

Summary

Mission-critical systems in todays enterprise networks are growing more dependent every day on smoothly functioning data processing systems. Network managers are thus faced with the enormous challenge of increasing the availability of their data processing infrastructures while these infrastructures grow in both size and complexity. Network management is further complicated by the fact that corporate intranets are increasingly accessible through remote or public networks, such as the Internet, telecommunication service providers, customers networks, telecommuters systems and so on. It is no longer possible to have complete, end-to-end control over a company network. This makes it even more important to plan network operation and maintenance systematically, to implement appropriate procedures, and to have experienced network support staff equipped with advanced diagnostic and management tools.

NETWORK AVAILABILITY

For additional excerpts from this chapter and other Network Troubleshooting book sections, be sure to regularly visit our web site at:

www.FreeTroubleshootingBook.com

New chapters will be posted every 2 to 3 weeks. Be sure to visit our web site and vote for the chapters you would like to see posted!

Ethernet IEEE 802.3ba Standard
100% (1)
Ethernet IEEE 802.3ba Standard
14 pages
Schedule of Rates For Telecom Works
No ratings yet
Schedule of Rates For Telecom Works
15 pages
Week 1 2 - Foundation of Information System in Business
No ratings yet
Week 1 2 - Foundation of Information System in Business
58 pages
RTL8370 (M)
100% (1)
RTL8370 (M)
101 pages
Information Technology-I: Instructor: Ms Shama Siddiqui
No ratings yet
Information Technology-I: Instructor: Ms Shama Siddiqui
63 pages
SIAE - ALS PDH Radio Family - Manual
No ratings yet
SIAE - ALS PDH Radio Family - Manual
324 pages
Telecommunications and Networks
100% (112)
Telecommunications and Networks
27 pages
Bottling Plant Data Systems Guide
No ratings yet
Bottling Plant Data Systems Guide
55 pages
Chapter 1. Networking and Storage Concepts
No ratings yet
Chapter 1. Networking and Storage Concepts
31 pages
Telecommunications and Networking in Today's Business World
100% (4)
Telecommunications and Networking in Today's Business World
43 pages
Instalação CLP Bosch
No ratings yet
Instalação CLP Bosch
91 pages
Physical vs. Logical Topologies: Jim Murray
No ratings yet
Physical vs. Logical Topologies: Jim Murray
7 pages
RS-232 / RS-422 / RS-485 Over Fast Ethernet Media Converter: ICS-100 ICS-102 ICS102-S15
No ratings yet
RS-232 / RS-422 / RS-485 Over Fast Ethernet Media Converter: ICS-100 ICS-102 ICS102-S15
4 pages
Building Management Systems Integration
No ratings yet
Building Management Systems Integration
11 pages
ARCNET: Pioneering LAN Protocol
No ratings yet
ARCNET: Pioneering LAN Protocol
5 pages
Cooperative Development of An Arduino-Compatible Building Automation System For The Practical Teaching of Electronics
No ratings yet
Cooperative Development of An Arduino-Compatible Building Automation System For The Practical Teaching of Electronics
7 pages
iD-It Stentor 216 - Stentor 416 User Manual
No ratings yet
iD-It Stentor 216 - Stentor 416 User Manual
35 pages
1.6 Introduction To Networks - Reliable Networks
No ratings yet
1.6 Introduction To Networks - Reliable Networks
5 pages
Rocket R5 Datasheet
No ratings yet
Rocket R5 Datasheet
24 pages
Data and Computer Communications: Tenth Edition by William Stallings
No ratings yet
Data and Computer Communications: Tenth Edition by William Stallings
33 pages
100 Mbit/s SFP: LC Connector LC Connector
No ratings yet
100 Mbit/s SFP: LC Connector LC Connector
6 pages
Modicon Premium - TSXH5724M
No ratings yet
Modicon Premium - TSXH5724M
3 pages
FIU 19E Indoor Unit
No ratings yet
FIU 19E Indoor Unit
18 pages
LRM Slash Macsec Adapter Data Sheet PDF
No ratings yet
LRM Slash Macsec Adapter Data Sheet PDF
4 pages
RICi-16E1T1 2.5 MN
100% (1)
RICi-16E1T1 2.5 MN
234 pages
PoE Extender - 4-Port-Spec-Sheet
No ratings yet
PoE Extender - 4-Port-Spec-Sheet
3 pages
Foundations of Information Systems in Business
No ratings yet
Foundations of Information Systems in Business
56 pages
Cabling Support Distances Guide
No ratings yet
Cabling Support Distances Guide
3 pages
Management Information Systems Overview
No ratings yet
Management Information Systems Overview
4 pages
Netwok Data Cable Conduit - Fill - Guide
No ratings yet
Netwok Data Cable Conduit - Fill - Guide
2 pages
jd118b Datasheet PDF
No ratings yet
jd118b Datasheet PDF
2 pages
Ethernet Port Fault PDF
No ratings yet
Ethernet Port Fault PDF
11 pages
Lect 05 Students
No ratings yet
Lect 05 Students
58 pages
Foundations of Information Systems in Business
No ratings yet
Foundations of Information Systems in Business
39 pages
SYAD - Week 1 Information Systems
No ratings yet
SYAD - Week 1 Information Systems
53 pages
PH - Scalance XB 200 XC 200 XF 200ba XP 200 XR 300WG WBM - 76
No ratings yet
PH - Scalance XB 200 XC 200 XF 200ba XP 200 XR 300WG WBM - 76
350 pages
5.2.2 Packet Tracer - Configure EtherChannel - ILM
No ratings yet
5.2.2 Packet Tracer - Configure EtherChannel - ILM
8 pages
Management Information Systems
No ratings yet
Management Information Systems
34 pages
Introduction to Computer Networking
No ratings yet
Introduction to Computer Networking
25 pages
Information Communication Technology
100% (1)
Information Communication Technology
25 pages
Systems & Network Admin Concepts
No ratings yet
Systems & Network Admin Concepts
414 pages
Chapter One: Foundations of Information Systems in Business
No ratings yet
Chapter One: Foundations of Information Systems in Business
45 pages
IT in Management
No ratings yet
IT in Management
222 pages
VLAN & STP Quiz for Network Engineers
No ratings yet
VLAN & STP Quiz for Network Engineers
15 pages
Introduction To Information Technology
No ratings yet
Introduction To Information Technology
4 pages
E-Business Infrastructure: The Internet Technology
No ratings yet
E-Business Infrastructure: The Internet Technology
51 pages
Data and Computer Communications: Tenth Edition by William Stallings
No ratings yet
Data and Computer Communications: Tenth Edition by William Stallings
33 pages
Lecture 1 - Introduction To Information Systems
No ratings yet
Lecture 1 - Introduction To Information Systems
48 pages
Chapter 2 MIS
No ratings yet
Chapter 2 MIS
59 pages
CH 01 - DCC10e Data Communications, Data Network and Internet
No ratings yet
CH 01 - DCC10e Data Communications, Data Network and Internet
33 pages
Data Communications, Data Networks, and The Internet
No ratings yet
Data Communications, Data Networks, and The Internet
28 pages
DSMS 24G2G2F
No ratings yet
DSMS 24G2G2F
5 pages
Information System in Decision Making
No ratings yet
Information System in Decision Making
7 pages
Schneider Copper - July 2024 Price List
No ratings yet
Schneider Copper - July 2024 Price List
2 pages
Chapter 1 Data Comm Behrouz - Docx1
No ratings yet
Chapter 1 Data Comm Behrouz - Docx1
15 pages
Data and Computer Communications: Tenth Edition by William Stallings
No ratings yet
Data and Computer Communications: Tenth Edition by William Stallings
33 pages
Chapter 1 - Introduction To Systems Analysis and Design
No ratings yet
Chapter 1 - Introduction To Systems Analysis and Design
43 pages
Computer Networks Course Overview
No ratings yet
Computer Networks Course Overview
12 pages
A Table of Content
No ratings yet
A Table of Content
19 pages
Topic 1 Communications Networks
No ratings yet
Topic 1 Communications Networks
5 pages
Chapter 1 (Reviewer)
No ratings yet
Chapter 1 (Reviewer)
27 pages
Chapter 6 (Ismat)
No ratings yet
Chapter 6 (Ismat)
87 pages
IT Infrastructure Essentials
No ratings yet
IT Infrastructure Essentials
7 pages
Ict Infrastructure
No ratings yet
Ict Infrastructure
17 pages
CYBR-5010-12 - Week-1 Apa
No ratings yet
CYBR-5010-12 - Week-1 Apa
5 pages
Unit One Plan and Design Internet Infrastructure
No ratings yet
Unit One Plan and Design Internet Infrastructure
21 pages
Business Informatics & IT Infrastructure
No ratings yet
Business Informatics & IT Infrastructure
43 pages
Unit I
No ratings yet
Unit I
16 pages
7 Net Telecom
No ratings yet
7 Net Telecom
82 pages
2-Amaliy Ish
No ratings yet
2-Amaliy Ish
8 pages
CH 01 - DCC10e (Jkim Modified Fall 2022)
No ratings yet
CH 01 - DCC10e (Jkim Modified Fall 2022)
34 pages
Information and Communication Technology
No ratings yet
Information and Communication Technology
91 pages
2024-C3 - Ha Tang HTTT Va Dich Vu
No ratings yet
2024-C3 - Ha Tang HTTT Va Dich Vu
58 pages
Assignment 4 - IC - Nabila, Syabilla, Rois
No ratings yet
Assignment 4 - IC - Nabila, Syabilla, Rois
11 pages
Chapter 1. Computer Network Overview
No ratings yet
Chapter 1. Computer Network Overview
37 pages
Introduction To Information Systems
No ratings yet
Introduction To Information Systems
82 pages
Kenneth C Laudon, Jane P Laudon Management Information Sysrem 13th-280-282
No ratings yet
Kenneth C Laudon, Jane P Laudon Management Information Sysrem 13th-280-282
3 pages
MIS CH 5
No ratings yet
MIS CH 5
64 pages
M02 - Build Internet Infrastructure
No ratings yet
M02 - Build Internet Infrastructure
44 pages
Internet Note
No ratings yet
Internet Note
16 pages
1 Introduction To Networks
No ratings yet
1 Introduction To Networks
8 pages
CH 01 - Dcc10e
No ratings yet
CH 01 - Dcc10e
33 pages
Build Internet Infrastructure
No ratings yet
Build Internet Infrastructure
18 pages
Chapter 5 Tutorials Solutions 2024
No ratings yet
Chapter 5 Tutorials Solutions 2024
7 pages

Network Troubleshooting: by Othmar Kyas

Uploaded by

Network Troubleshooting: by Othmar Kyas

Uploaded by

SECTION I BASIC CONCEPTS

Excerpts taken from:

1.6 High Availability and Fault Tolerance in Networks 1.7 Summary

The Strategic Importance of Information Technology

SECTION I BASIC CONCEPTS

S o u r c e : L e o a . N e fio d o w , " T h e fifth K o n d r a tie ff"

B u s in e s s p r o c e s s e s S to c k tr a d in g C r e d it c a r d /te le c a s h tr a n s a c tio n s P a y - p e r - v ie w H o m e s h o p p in g ( T V ) M a il o r d e r ( c a ta lo g ) A ir lin e r e s e r v a tio n P a r c e l s e r v ic e

Figure 1.2 Costs of network failure

Intranets and the Internet: Revolutions in Network Technology

SECTION I BASIC CONCEPTS

( 5 2 M b it /s , 1 0 0 M b it/ s , 1 5 5 M b it/s , 6 2 2 M b it/s , 2 .4 G b it/s , 9 .9 G b it/s ) 1 0 G ig a b it/s E th e r n e t

A T M b a n d w id th s a r e in d ic a te d p e r s w itc h p o r t. A 1 5 5 M b it/s A T M s w itc h w ith 1 0 p o r ts r e p r e s e n ts a s u b d iv id e d b a n d w id th o f 1 .5 G b it/s !

1 5 5 M b it/s 1 3 3 M b it/s F D D I-2

1 0 0 B a s e -T T C N S ( T h o m a s C o n r a d N e tw o r k in g S y s te m ) IE E E 8 0 2 .1 2 (1 0 0 B a s e -V G -A n y L A N ) L A N S w itc h in g ( n x 1 0 M b it/s , n x 1 0 0 M b it/s )

F r a m e R e la y ( 1 .5 4 5 M b it/s ) T o k e n R in g ( 1 6 M b it/s ) 1 6 M b it/s 1 0 M b it/s E th e rn e t T o k e n R in g ( 4 M b it/s ) x D S L ( 1 6 M b it/s ) Is o c h ro n o u s E th e rn e t

Figure 1.3 The development of data transmission technologies: 1980 to 2000

The Behavior of Complex Network Systems: Catastrophe Theory

SECTION I BASIC CONCEPTS

Figure 1.4 Catastrophe with two control parameters: the cusp

The Causes of Network Failure

SECTION I BASIC CONCEPTS

Mass Storage Problems

Figure 1.6 Causes of system failures in data processing

Computer Hardware Problems

SECTION I BASIC CONCEPTS

h e r la y e r s , fig u r a tio n e r r o r s , to c o l e rro rs , e tc . % )

h e r la y e r s , fig u r a tio n e r r o r s , to c o l e rro rs , e tc . % )

SECTION I BASIC CONCEPTS

Calculation and Estimation of Costs Incurred Due to Network Failures

(Costs arising within the first 24 hours)

SECTION I BASIC CONCEPTS

(Costs arising after the first 24 hours)

Figure 1.8 Calculating the costs of network downtime

High Availability and Fault Tolerance in Networks

SECTION I BASIC CONCEPTS

A r c h ite c tu r e U n in te r r u p tib le o p e r a tio n F a u lt to le r a n c e F a il- o v e r b y c lu s te r F a u lt r e s ilie n c e ( fa il- o v e r ) H ig h a v a ila b ility S ta n d a rd s y s te m

Figure 1.9 Availability levels and downtime

D e fin e a v a ila b ility g o a l D e fin e m a x im u m fa ilu r e d u r a tio n

Figure 1.10 The introduction of a high-availability (HA) system

SECTION I BASIC CONCEPTS

You might also like