Solution Manual

CSA Solution Module wise

Uploaded by

NimmymolManuel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

1K views191 pages

Solution Manual

CSA Solution Module wise

Uploaded by

NimmymolManuel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 191

SoLuTIoNs MANUAL To ACCOMPANY HWANG ADVANCED COMPUTER ARCHITECTURE PARALLELISM SCALABILITY PROGRAMMABILITY HWANG-CHENG WANG University of Southern California Junc-Gen Wu National Tiawan Normal University ‘McGraw-Hil, Inc. New York St.Louis San Francisco Auckland Bogota Caracas Lisbon London Madrid Mexico City Milan Montroal New Delhi SanJuan Singapore Sydney Tokyo Toronto‘Solutions Manual to Accompany Hivang ADVANCED COMPUTER ARCHITECTURE Paralelism, Scalability, Progranmatilty Copyright © 1999 by MeGrave-Hif, nc. All rights reserved. Print in ho Unitod Statos of Amavica. Tho contonis or parts thereof may be reproduced for use with ADVANCED COMPUTER ARCHITECTURE Panalleism, Scalability, Programabiliy by Kai Hwang provided such reproductions bear copytight notice, ut may not be reprodueedin any form for any odor purpose without permission of the publisher. ISBN 0-07-0916236 234567890 HAM HAM co98765Foreword Preface Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter L 2 3 4 5 6 7 8 9 Chapter 10 Chapter 11 Chapter 12 Bibliography Contents Parallel Computer Models Program and Network Properties Principles of Scalable Performance Processors and Memory Hierarchy ..... Bus, Cache, and Shared Memory Pipelining and Superscalar Techniques ... Multiprocessors and Multicomputers Multivector and SIMD Computers... Scalable, Multithreaded, and Dataflow Architectures Parallel Models, Languages, and Compilers Parallel Program Development and Environments UNIX, Mach, and OSF/1 for Parallel Computers 173 . 183Foreword Dr. Hwang-Cheng Wang and Dr. Jung-Gen Wa have timely produced this Solutions Manual, I believe it will benefit many instructors using the Advanced Architecture: Parallelism, Scalability, Prograramatility (ISBN 0-07-031622-8) as a required textbook. Drs, Wang and Wa have provided solutions to all the problems in the text. Some of the solutions are unique and have been carefully worked out. Others contain just a sketch of the underlying principles or computations involved. For such problems, they hhave provided references which should help instructors find out more information in relevant sources. ‘The authors have done an excellent job in putting together the solutions. How- ever, as with any scholarly work, there is always room for improvement. ‘Therefore, instructors are encouraged to communicate with us regarding possible refinement to the solutions, Comments or errata can be sent to Kai Hwang at the University of South- em California. They will be incorporated in future printings of this Solutions Manual. Sample test questions and solutions will also be included in the future to make it more ‘comprehensive. ‘Finally, I want to thank Dz. Wang and Dr. Wa and congratulate them for a difficult job well done within such a short time period. Kai Hwang { { |Preface ‘This Solutions Manual is intended for the exclusive use of instructors only. Repro- duction without permission is prohibited by copyright laws ‘The solutions in this Manual roughly fall in three categories « For problem-solving questions, detailed solutions have been provided. In some cases alternative solutions are also discussed. More complete answers can be found in the text for definition-type questions. © For research-oriented questions, a summary of the ideas in key papers is pre- sented. Instructors are urged to consult the original and more recent publications in literature. « For questions that require computer programming, algorithms or basic computation steps are specified where appropriate. Example programs can often be obtained from on-line archives or libraries available at many research sites. Equations and figures in the solutions are numbered separately from those in the text. When an equation or a figure in the text is referenced, itis cleatly indicated. Code ‘segments have been written in assembly and high-level languages. Most codes should be self explanatory. Comments have been added in some places to help understanding the function performed by each instruction. We have made tremendous effort to improve the correctness of the answers. But a few errors might have been undetected, and some factors might have been overlooked in our analysis. Moreover, several questions are likely to have more than one valid solution; solutions for research-oriented problems are especially sensitive to progress in related areas. In the light of these, we welcome suggestions and corrections from instructors. Acknowledgments ‘We have received a great deal of help from our colleagues and experts during the preparation of this Manual. Dr. Chi-Yuan Chin, Myungho Lee, Weihua Mac, Fong Pong, Dr. Viktor Prasanna, and Shisheng Shang have contributed solutions to a number of the problems. Chien-Ming Cheng, Cho-Chin Lin, Myungho Lee, Jih-Cheng Lin, Weihua, Mao, Fong Pong, Stanley Wang, and Namhoon Yoo have geuerously shared their ideas, through stimulating discussions. We are indebted to Dr. Bill Nitzberg and Dr. David Black for providing useful information and pointing to additional references. Finally, our foremost thanks go to Professor Kai Hwang for many insightful suggestions and Judicious guidance. H.C. Wang 3.6. WaChapter 1 Parallel Computer Models Problem 1.1 AS X1432x241SK2+RXZ _ 185 ASX 14+ 32x 2415x248? _ 158 L155, r Pl Pore ESCEN Joo = 155 eveles/instruction, 40_x 10%cycles/sec 1.55 cycles/instruction (48000 x 1 + 32000 x 2+ 15000 x 2 + 8000 x 2}cycles (40 x 10®)cycles/s ‘The execution time can also be obtained by dividing the total number of instructions by the MIPS rate: (45000 + 32000 + 15000 + 8000)instructions 25.8 x 10° instructions/s MIPS rate = 10-8 x 25.8MIPS. Execution time = = 3.875 ms. 3.875 ms. Problem 1.2 Instruction set and compiler technology affect the length of the executable code and the memory access frequency. CPU implementation and control determines the clock rate. Memory hierarchy impacts the effective memory access time. ‘These factors together determine the effective CPI, as explained in Section 1.1.4. Problem 1.3 (a) The effective CPI of the processor is calculated as 15 x 10° cycles/sec opr = 1o x10 PI = T6510 instructions/see = 15 cycles/instroction. (b) The effective CPI of the new processor is (140.8 x 240.05 x 4) = .8 cycles/instruction. i2 Parallel Computer Models ‘Therefore, the MIPS rate is 30 x 10° eyeles/sec TS cyelesfinstruction ~ 16-7 MIPS. Problem 1.4 (a) Average CPI = 1x 0.6 +2% 0.18 +4 x 0.12 +8 x 0.1 = 2.24 cycles / instruction. (b) MIPS rate = 40/2.24 = 17.86 MIPS. Problem 1.5 (a) False. The fundamental idea of multiprogramming is to overlap the computations of some programs with the 1/0 operations of other programs. (b) True. In an SIMD machine, all processors execute the same instruction at the same time, Hence it is easy to implement: synchronization in hardware. In an MIMD machine, different processors may execute different instructions at the same time and it is difficult to support synchronization in hardware. (c) True, Interprocessor communication is facilitated by sharing variables on a mal- tiprocessor and by passing messages among nodes of a multicomphter. The multicomputer approach is usually mare difficult to program since the programmer must pay attention to the actual distribution of data among the processors. (a) False. In general, an MIMD machine executes different instruction streams on different processors. (e) True, Contention among processors to access the shared memory may create hot spots, making multiprocessors less scalable than multicomputers. Problem 1.6 The MIPS rates for different machine-program combinations are shown in the following table es Machine Program | Computer A [ Computer B | Computer C Program i | _100 0 5 Program 2] ___07 T 5 Program 3 | 02] __ Oa 2 [Program 4 1} 612s 1 ‘Various means of these values can be used to compare the relative performance of the computers. Definition of the means for a sequence of positive numbers 41,02, ...,0n are summarized below. (See also the discussion in Section 3.1.2.) (a) Arithmetic mean: AM = (SZ, a,)/n. () Geometric mean: GM = (JTL, ai)!"Parallel Computer Models 8 (c) Harmonic mean: HM = n/[S2.,(1/a:)}. In general, AM > GM > HM. qa) Based on the definitions, the following table of mean MIPS rates is obtained: Computer A | Computer B | Computer C ‘Arithmetic mean |___25.3 281 3.25 Geometric mean 119 0.59 2.66 ‘Harmonic mean 0.25 0.20 21 Note that the arithmetic mean of MIPS rates is proportional to the inverse of the harmonic mean of the execution times. Likewise, the harmonic mean of the MIPS rates is proportional to the inverse of the arithmetic mean of execution times. The two observations are consistent with Eq. 1.1 If we use the harmonic mean of MIPS rates as the performance criterion (i.e., each program is executed the same number of times on each computer), computer C has the best performance. On the other hand, if the arithmetic mean of MIPS rates is ‘used, which is equivalent to allotting an equal amount of time for the execution of each program on each computer (i.e. fast-running programs are executed more frequently), then computer A is the best choice. Problem 1.7 © An SIMD computer has a single control unit. The other processors are simple slave processors which accept instructions from the control unit and perform an identical operation at the same time on different data. Bach processor in an MIMD computer has its own control unit. and execution unit. At any moment, ‘a processor can execute an instruction different from the other processors, ‘* Multiprocessors have a shared memory structure. The degree of resource sharing is high, and interprocessor communication is carried out via shared variables in the shared memory. In multicomputers, each node typically consists of a pro: cessor and local snemory. The nodes are connected by communication channels which provide the mechanism for message interchanges among processors. Re- source sharing is light among processors. + In UMA architecture, each memory location in the system is equally accessible to all processors, and the access time is uniform. In NUMA architecture, the access time to a memory location depends on the proximity of @ processor to the memory location, Therefore, the access time is nonuniform. In NORMA architecture, each processor has its own private memory; no memory is shared among processors, Each processor is allowed to access its private memory only. In COMA architecture, such as that adopted by KSR-1, each processor lias its private cache, which together constitutes the global address space of the system, Its like a NUMA with cache in place of memory. A page of data can be migrated to a processor upon demand or be replicated on more than one processor.4 Parallel Computer Models Problem 1.8 (2) The total number of cycles needed on a sequential processor is (44448444 2-4 4) x 64 = 1664 cycles. : (b) Bach PE executes the same instruction on the corresponding elements of the vectors involved. There is no communication among the processors. Hence the tent number of cycles on each PE is 4+4+8444244= 26, (©) The speedup is 64 with a perfectly parallel execution of the code. Problem 1.9 Because the processing power of a CRCW-PRAM and an BREW-PRAM is the same, we need only focus on memory accessing. Below, we prove that the time com: Plexity of simulating a concurrent write or a coneurrent read on an EREW.PRAM in Ologn). Before the proof, we assume it is known that an EREW-PRAM can sort a numbers or write a number to n memory locations in O(log n) time. (a) We present the proof for simulating concurrent writes below. 1. Create an auxiliary array A of length n. When CROW processor P,, for i 0,1,...0—1, desires to write a datum, @; to a location I;, each corresponding EREW processor P; writes the ordered pair (I,2:) to location Ali) These writes are exclusive, since each processor writes toa distinct memory location. > Sort the array by the first coordinate of the ordered pairs in O(log n) time, which canses all data written to the same location to be brought together in the output, §: Bach EREW processor P,, for i = 1,2,..,m—1, now inspects Afi) = (lj) and Afi 1] = (l4,z4), where j and & are values in the range 0< j,k ) We present the proof for simulating concurrent reads as follows: 1. Create an auxiliary array A of length n. When CROW processor P,, for £5 Ot. =1, desires to read a datum from a location I, each corresponding ZREW processor P, writes the ordered three-tuple (i,1,.2,) t0 location Alf}, in which the a is an arbitrary number. These writes are exclusive, since cack Processor writes to a distinct memory location, 2 Sort the array A by the second coordinate of the three-tuple in O(log) time, which causes all data read from the same location to be brouzht together in the output.Parallel Computer Models 5 3, Bach EREW processor P,, for i = 1,2,....m—1, now inspects Ali) = (j,1j,23) and Afi~ 1] = (k,lk,2x), where j and k are values in the range 0 < j,k < n-1 Il #k, ori = 0, then processor P,, for i = 0,1,..)n—4, reads the datum from location |; in global memory. Otherwise, the processor does nothing, Since the array A is sorted by the second coordinate, only one of ‘the processors reading from any given location actually succeeds, and thus the read is exclusive. 4, Each EREW processor P; that read a datum stores the datum to the third coordinate in Ali], and then broadcasts it to the third coordinate of lj)’s for G=i+i+2,..., and lj =k. This takes O(log) time. 5. Sort the array A by the first coordinate of the ordered three-tuple in O(log n) time. 6. Each EREW processor P; reads data in the third coordinate from Afi]. These reads are exclusive, since each processor reads from a distinct memory location, ‘This process thus implements each step of concurrent reading in the common- RCW model in O(log n) time. Problem 1.10 For multiplying two n-bit binary integers, there are 2n bits of input, and 2n bits of output, ‘Suppose the circuit in question, in the grid model, is a rectangle of height h and width w as shown in the following diagram: Assume without loss of generality that h < w. and there is at most one word along: each grid line. It is possible to divide the circuit by a line as shown in the figure. This line runs betweon the grid lines and runs vertically, except possibly for « single jog of one grid unit. Most importantly, we can select. the line so that at least 2n/3 of the output bits (ie., 1/3 of the output bits) are emitted on each side. We select the line by sliding it from left to right, until the first point at which at least 2n/3 of the output bits are output to the left of the line. If no more than 4n/3 of these bits are output to the left, we are done. If not, start from the top, considering places to jog the line back one unit to the left. We know that if the line jog at the every top, fewer than 4n/3 of the bits are emitted to the left, and if the line jogs at the very bottom, more than 2n/3 are. Thus, as no single grid point6 Parallel Computer Models can be the place where as many as n/3 of the bits are emitted, we can find a suitable place in the middle to jog the line. There, we shall have between 2n/3 and 4n/3 of the output bits on each side. Now assume without loss of generality that at least half of the input bits are read on the left of the line, and let us, by renumbering bits, if necessary, assume that these aT Day Sak; --» Fqr- Suppose also that output bits yi,, Yiay ——» Yiggyy ATE OUPUL on. the right. We can pick values ¢o that 4, =i. Yi = 22k, and so on. ‘Thus information regarding the 2n/3 input bits, 2, Zak, -. Zakny3, Inust cross the line. We may assume at most one wire or circuit element along any grid line, so the number of bits crossing the line in one time unit is at most +1 (h horizontal and one vertical, at the jog). It follows that (h+1)P > 2n/3, or else the required 2n/3 bits ‘cannot cross the line in time. Since we assume w > h, we have both hT = %n) and wT = O(n). Since wh = A, we have AT? = O(n4) by taking the product. That is, AT? > kn, Problem 1.11 (a) Since the processing clements of an SIMD machine read and write data from dif ferent memory modules synchronously, no access conflicts should arise, Thus any PRAM variant can be used to model SIMD machines. (b) ‘The processors in an MIMD machine can read the same memory location simultaneously. However, writing to a same memory location is prohibited. ‘Thus the CREW-PRAM can best model an MIMD machine. Problem 1.12 (a) Phe memory organization changed from UMA model (global shared memory) to NUMA model (distributed shared memory). (b) The medium-grain multicomputers use hypercube as their interconnection networks, while the fine-grain multicomputers use lower dimensional k-ary n-cube (e.g., 2D or 3-D torus) as their interconnection networks. (c) In the register-to-register architecture, vector registers are used to hold vector operands, intermediate and final vector results, In the memory-to-memory architecture, vector operands and results are retrieved directly from the main memory by using a vector stream unit. (4) Ima single threaded architecture, each processor maintains a single thread of control with limited hardware resources. In a multithreaded architecture, each processor can exernte multiple contexts by switching among threads. Problem 1.13 Assume the input is A(i) for 0 | 10 ° a 3 NS | NN : 2 8 2 WW 0 Problem 2.11 (a) To design a direct network for a 64-node multicomputer, we can use « A3D torus with 4 nodes along each dimension. The relevant parameters are: d= 3[r/2| =6, D = 3[r/2] = 6, andl = 3N = 192. Also, dx Dx! = 6912. @ A G-dimensional hypercube. The relevant parameters are: d 6, D=n=6, and l= nx N/2=6 x 64/2 = 192. We have dx Dx | = 6912. = A CCC with dimension & = 4. The relevant parameters are: d= 3, D = Bk — 1+ [k/2| = 2x 4-14 [4/2] = 9, andl = 3N/2 = 96. The value of ax Dx Lis 259218 (b) Program and Network Properties If the quality of a network is measured by (dx Dx 1)", then a CCC is better than a 3-D torus or a 6-cube. A 3-D toms and a 6-cube have the same quality. + The torus and hypercube have similar network properties and are treated (14 6)x6 i 2 of information ii. between nodes at a distance i. Then we have 6 P(2) 2 PB) = A.Pa= FP) = BPO) = zz ‘Therefore, the mean internode distance is 5 > together. We have ) = 21. Denote by P(i) the probability { P(t) 3g 2 Ix ptax Sean can Saxe ox d= G49)x9 = 2 ode communication for distance 7 are { PQ) one = re =$re@=3 PC) = 2,P)= Pe =% » * For the CCC, we have )° 45. The probabilities of intern- t - P= S Hence, the mean internode distance is 9 Sia tig 6 1 _ 165 Datta tox toe +5 gtx 1 tt Zeox Bb e In conclusion, the mean internode distance of 4-CCC is greater than that of 6- cube and 3-D torus. 6-cube and 3-D torus have identical mean internode distance. ‘The similarity of the 6-cube and 3-D torus in the above is more than incidental. In fact, it has been shown [Wang88] that when k = 4 (as is the case for this problem), a k-ary n-cube is exactly a 2n-dimensional binary hypercube. Problem 2.12 fa) It should be noted that we are looking for nodes that can be reached from No in exactly 3 steps. Therefore, nodes that can be reached in 1 or 2 steps have to be excluded. © For an &x8 Dliac mesh, they can be caleulated by the equation (a-+b-+e) mod 64, where a,b, and ¢ can be +1,~1,+8, or ~8, There are 20 combinations (4 if a,b, and c are all different; 12 if two of them are equal; 4 if a = 6 = c) However, 8 of the combinations contain the pair +1 and -1 or the pair +8 and —8, making them reachable in one step. Such nodes have to be eliminated from the list. Hence, 12 nodes can be reached from Np in three steps. The addresses of these nodes are 3, 6, 10, 15, 17, 24, 40, 47, 49, 54, 58, and 61.Program and Network Properties 19 « Fora binary 6-cube, the binary address as...0; 49 of a node reachable in three steps from No has exactly three 1's. There are 20 possible combinations (C(6,3)).. The addresses of these nodes are 7, 11, 13, 14, 19, 21, 22, 25, 26, 28, 35, 37, 38, 41, 42, 44, 49, 50, 52, and 56. # The nodes reachable in exactly three steps can be determined as follows. List all 6-bit murmbers which contain three 1s. There are 20 such numbers. First take 1's complement of each number and then add 1 to each of the resulting numbers. (Equivalently, the new numbers are obtained by subtracting each of the original numbers from 64.) If a new mumber has three or four 1s in its binary representation and the Is are separated by at least one 0, then both nodes whose addresses are the original number and the new number can be reached in exactly three steps. (The last point of the rule is due to the fact that clustered 1s can always be replaced by two 1s.) The addresses of these nodes are 11, 13, 19, 21, 22, 28, 25, 26, 27, 29, 35, 37, 38, 39, 41, 42, 43, 45, 51, and 53. (b) The upper bound on the minimum number of routing steps needed to send data from any node to another for an 8 x 8 Iliac mesh is 7 (= V64 — 1), for a 6-cube is 6, and for a 64-node barrel shifter is 3 (= logy 64/2). (c) The upper bound on the minimum number of routing steps needed to send data from any node to another is 31 for a 32 x 32 Illiac mesh, 10 for a 10-cube, and 5 for a 1024-node barrel shifter Problem 2.13 Part of Table 2.4 in the text is duplicated below: Network Bus Waitistage Crossbar Characteristics | __ System Network Switch, Mininnim Tateney for unit data | Constant O(log, n) ‘Constant transfer Bandwidth TGafnj to Ow) | Ova} to Otway | Ola) to OCR) per processor Wiring OCay ‘Onawtog ny Oiaray Complexity, Switching omy ‘OGriog, wy Ow Complexity 7 jo Connectivity | Only oie to one Some permutations | All permutal and routing ata time. and broadcast, if | one at a time. capability network unblocked Remarks — ‘Assume n proce- [nx n MIN ‘Assume n Xn ssors on the bus; | using k x & crossbar with bus width is w | | switehes with line | tine width of | bits, width of w bits. | w bits |20 Program and Network Properties Problem 2.14 (a) For each output terminal, there are 4 possible connections (one from each of the input terminals), so that there are 4 x 4 x 4x 4 = 256 legitimate states. (b) 48 (= 16 x 3) 4 x 4 switch modules are needed to construct a 64-input Omega network. There are 24 (= 4x 3 x 2x 1) permutation connections in a 4 x 4 switch module. Therefore a total of (244) permutations can be implemented in a single pass through the network without blocking. {c) The total number of permutations of 64 inputs is 641. So the fraction is 24**/6 La x 10-7, Problem 2.15 (a) We label the switch modules of a 16 x 16 Baseline network as below. OG TE SEE SE! Eee: ‘Then, we change the positions of some switch modules, the Baseline network becomes:Program and Network Properties 21 * rR i a which is just an Omega network. (b) If we change the positions of some switch modules in the Baseline network, it becomes: -EE ERE ah a oe /{= S 4 "Vp A Ke * ca a = RE ‘ Fea > eee ee cs ENE >RIERE ep 7 ee 3 opp See ye oy ; which is just the Flip network, {c) Since both the Omega network and the Flip network are topologically equivalent to the baseline network, they are topologically equivalent to each other.22, Program and Network Properties Problem 2.16 (a) (b) nlk/2} (c) 2K (a) 2n, fe) A k-ary Loube is a ring with k nodes. A kary 2cube is a 2D k x k torus. ‘A mesh is a torus without end-around connections. A 2ary n-cube is a binary n-cube. An Omega network is the multistage network implementation of shuffie- exchange network. Its switch modules can be repositioned to have the same interconnection topology as a binary n-cube {f) The conventional torus has long end-around connections, but the folded torus has equal-length connections. (See Figure 2.21 in the text). (gs) # The relation B= 2wN/k will be shown in the solution of Problem 2.18. Therefore, if both the number of nodes NV and wire bisection width B are constants, the channel width W ‘will be proportional to & w= B/b = Bk/(2N). ‘The latency of a wormhole-routed network is Zyfp uty Twa = which is inversely proportional to w, hence also inversely proportional to k. This means a network with a higher k will have lower latency. For two k-ary n-cube networks with the same number of nodes, the one with a lower dimension has a larger k, and hence @ lower latency. + If will be shown in the solution of Problem 2.18 thas the hot-spot throughput is equal to the bandwidth of a single channel Ons =k/2 ‘Low-dimensional networks have a larger k, hence a higher hot-spot throughput.Program and Network Properties 23 Problem 2.17 (a) In a tree network, a message going from processor i to processor j goes up the tree to their least common ancestor and then back down according to the least significant bits of j. Message traffic through lower-level (closer to the root) nodes is heavier than that of higher-level nodes. The lower-level channels in a fat tree has a greater number of wires, and hence a higher bandwidth. This will prevent congestion in the lower-level channels, (b) The capacity of a universal fat tree at level k is cy = min([n/2*), fw/2* #9 Ik > Blog(n/w), [n/2*} < [w/2*!9), Therefore, cx = [n/2*] = (n+1)/2*, which is 1, 2,4, .., for k = log(n + 1),log(n + 1) = 1, log(n + 1) = 2, © If & < 3log(n/w), then [70/2] > [w/2%*/}. Hence cg = [wo/2*/5], which 5s ny w/879,w/42!, wo /29, for b= «.58,2,1. ‘© Initially, the capacities double from one level to the next toward the root, but at levels less than 3log(n/w) away from the root, the channel capacities, grow at the rate of 4, Problem 2.18 (a) A Kary n-cube network has V nodes where N = k*, Assume k is even. If the network is partitioned along one dimension into two parts of equal size, the “cross section” separating the two parts is of size N/k. Corresponding to each node in the cross section, there are two wires, one being the nearest-neighbor link and the other wraparound link in the original network. Therefore, the cross section contains 2N/k wires each w bits wide, giving a wire bisection width B = bu = 2w.N/k. ‘The argument also holds for k odd, although the partitioning is slightly more complex. (b) The hot-spot throughput of a network is the maximum rate at which message can be sent from one specific node P, to another specific node P;. For a k-ary n-cube with deterministic routing, the hot-spot throughput, @5, is equal to the bandwidth of a single channel w. From (a), w = KB/(2N). Therefore, us = kB/(2N), which is proportional to & for a fixed B.24 Program and Network Properties Problem 2.19 (a) Embedding of am rx r torus.in a hypercube is shown in the following diagrams for r = 2 and 4, respectively ((2) and (c)). As can be seen, if the nodes of a torus ‘are numbered properly, we obtain inter-node connections identical to those of a hypercube (nodes whose numbers differ by a power of 2 are linked directly), A 2r x 2r torus can be constructed from r xr tori in two steps. In step one, # 2r xr torus is built by combining an r x r with its “anirror” image (in the sense of node numbering) and connecting the corresponding nodes, as shown in diagram (b). In step 2, the 2r xr torus is combined with its mirror image to form a 2r x 2r torus. In this manner, a torus can be fully embedded in a hypercube of dimension dwith 24 =r? nodes. In general, it has been shown that any my x2ny--- m; torus, where my = 2° can be embedded in a hypereube of dimension d = py + pe ++-+ p, with the proximity property preserved using binary reflected geay code for the mapping [Chans6}. {b) Embedding of ring on a COC ie equivalent to finding a Hamiltonian cycle on the CCC. In the following figure, the embedding of rings ox CCCs for k = 3 and 4, respectively, is shown. It is easy to first consider the embedding of a ring on a binary hypercube by treating the cycle at each vertex of the hypercube as a supernode. ‘This step can be carried out easily and there are several possible ways to embed a ring on a hypercube. Then, whenever a supernode on the embedded ing is visited, all the nodes in the corresponding eycle are Tinked. | |Program and Network Properties 25 (c) Embedding of a complete balanced tree in a mesh is shown in the following diagram for trees of different heights. In general, the root of a tree is mapped to the center node of a mesh, and leaf nodes are mapped to outlying mesh nodes. The process is recursive. Suppose a tree of height ! > 3 has been embedded in an r x 7 mesh. ‘When embedding a tree of height !+ 1, an (r +2) x (r +2) mesh is needed, with the new leaf nodes mapped to the boundary nodes of the augmented mesh. This is illustrated for J = 3.26 Program and Network Properties © Problem 2.20 (a) A hypernet combines the hierarchical structure of a tree in the overall architecture and the uniform counectivity of a hypercube in its building blocks. (b) By construction, a hypernet built with identical modules (buslets, treelets, cubelets, etc.) has a constant node degree. This is achioved by a systematic use of the ex- terual links of each cubelet when building larger and larger systems. (c} The hypemet architecture was proposed to take advantage of localized communication patterns present in some applications such as connectionist neural networks, ‘The connection structure of hypernets gives effective support for communications between adjacent lower-lovel clusters. Global communication is also. supported, ‘but the bandwidth provided is lower. Algorithms with commensurate nonuniform ‘communication requirements among different components are suitable candidates for implementation on hypernets. 1/0 capability of a hypernet is furnished by the external links of each building Block. Asa result, I/O devices can be spread throughout the hierarchy to meet I/O demand. Fault. tolerance is built into the hypernet architecture to allow graceful degradation. Execation of a program can be switched to a subnet in case’of node or link failures. The modular construction also facilitates isolation and subsequent replacement of faulty nodes or subnets.Chapter 3 Principles of Scalable Performance Problem 3.1 (a) CPL = 1x 0.642% 0.18-+4 x 0.12 +12 x 0.1 = 2.64 cycles / instruction. (b) MIPS rate = —=stthevcles/s__. _ 69 goMIPS. Tecycles instruction (c) When a single processor is used, the exccution time is ty = 200000/17.86 = 1.12 x 10* ys. When four processors are used, the time is reduced to ty = 220000/60.60 3.63% 10° ss, Hence the speedup is 11,2/3.63 = 3.08 and the efficiency is 3.08/1 0.77. Problem 3.2 (a) If the vector mode is not used at all, the execution time will be 0.75T + 9 x 0.25T = 37. ‘Therefore, effective speedup = 32/T = 3. Let the fraction of vectorized code be a. Then a =9 x 0.257 /3T = 0.75. (b) Suppose the speed ratio between the vector mode and the scalar mode is doubled. ‘The execution time becomes O.78T + 0.257 /2 = O.8TST- ‘The effective speedup is 31/0.875T = 24/7 = 3.43. (c) Suppose the speed for vector mode computation is still nine times as fast as that for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio ‘@ must satisfy the following relation:28 Principles of Scalable Performance Solving the equation, we obtain a = 51/64 = 0.8. Problem 3.3 {a) Suppose the total workload is W million instructions. Then the execution time seconds is w pa e® U-ow nz Sr ‘Therefore, the effective MIPS rate is w ne ne TF atal—a) n-(—ija’ (b) Substituting the given data into the expression in (a), we have { ex 4 ext Lay, i6-i5a which can be solved to give a = 24/25 = 0.96. Problem 3.4 Assume the speed in enhanced mode is n times as fast as that in regular mode, the harmonic mean execution time T’ is calculated as T(a} = a/R+ (1-a)/(nR), where R is the execution rate in regular mode. (a) Ha varies linearly between a and b, the average execution time is ' _ £EPla)da _ (A= b+0) +2 Fong = “ay ‘mR ‘The average execution rate is Rea Be HTP and the average speedup factor is 6, = Row jag R D(b+a) +2 ({b) Ifa +0 and 6 1, then Sug = Inf (n+ 1). Problem 3.5Principles of Scalable Performance 29 (a) The harmonic mean execution rate in MIPS is 1 Ro ATR ‘The arithmetic mean execution time is T= AIR. rt (b) Given fi = 0.4, fe = 0.3, fs = 0.2, fy = 0.1, and R, = 4 MIPS, Ry = 8 MIPS, Ry = 11 MIPS, Ry = 15 MIPS, the arithmetic mean execution time is T = 04/4 + 0.3/8 + 0.2/11 +0.1/15 = 0.162 ps per instruction. Several factors cause Rj to be smaller than 5i. First, there might be memory access operations which take extra machine cycles. Second, when the number of processors is increased, more memory eecess conficts arise, which increase the exccution time and lower the effective MIPS rate. Third, part of the program may have to be executed sequentially or can be executed. by only a limited number of processors simultaneously. Finally, there is an overhead for processors to synchro- nize with each other. Because of these overheads, Rj/i typically decreases with i (c) Given a new distribution fy = 0.1, fo = 0.2, fs = 0.3, and f, = 0.4 due to the use of an intelligent compiler, the arithmetic mean execution time becomes T = 0.1/4 + 0.2/8 + 0.3/11 +0.4/15 = 0.104 ys per instruction. Problem 3.6 Amdahl’s law is based on a fixed workload, where the problem size is fixed regardless of the machine size. Gustafson’s law is based on a scaled workload, where the problem size is increased with the machine size so that the solution time is the same for sequential and parallel executions. Sun and Ni's law is also applied to scaled problems, where the problem size is increased to the maxinauu memory capacity, Problem 3.7 (a) The total number of clock cycles needed is 1024 y= S(2 421) = 2 1004-4 1024 x 1095 = 1,051,628 Ss (b) If consecutive outer loops are assigned to a single processor, the workload is not balanced and the parallel execution time is dominated by that on processor 32. ‘The clock cycles needed is, 10% Y e+2) So = (993 +1025) x 18 = 64,608, a30 Principles of Scalable Performance: The speedup is, aon Sar * 1628. (c} To balance the load, we divide the outer loop into 64 chunks, each consisting of 16 iterations. Each processor is allocated a pair of chunks in a fold-over manner. That is, processor 1 is allocated the first and the last chunks, processor 2 the second and second to the last chunks, and so on. Thus, we have the following modified code: Doalll £ = 1, 32 Do 101 = (¢-1) x 16 +1, £x16 sUM() Do 20J=1,1 20 SUM(I) = SUM(I) + 3 10 Continue Do 301 = (64~@) x 16 + 1, (4-£4+1) x 16 SUM(D) = 0 Do 403 =1,1 40 SUM() = SUM() +3 30 Continue Endall = (4) Suppose the overhead associated with flow control is neglected. The number of cycles required for the computation in processor f,1 < €< 32, is, bas (e-21998 Q = Do @xitn+ SY exits e-fmenn seer“ tiaet = {@-1)x16+24¢x 1641) + (64-8 x 16+ 2+ (64 — E44) x 16+ 1} x 16 = 2054 x 16 = 32, 864. ‘The speedup in this case is __ 1051648 _ = "nese = Problem 3.8 (a} An example program is shown below. Assume a,,7 are the base addresses of A,B,C, respectively, which point to the first clement of the individual arrays. Also assume only a small number of registers are available. The notation M(addr) stands for the value stored in memory location addr. Mov —-R1,0 Initialize R1 index § Mov — 5,0 Initialize R5 = fxn Mov —-R7,0 Initialize R7 = offset of Crs Loop: Moy —-B2, 0 Reset R2 — index jPrinciples of Scalable Performance 31 Loop2: Mov R30 Reset R3 = index k Mov RA, 0 Reset RA =k x0 Mov 6, RS ixn Mov R11, 0 = value of Ci; Loops: Add R4, R2 Compute offset for Bas Load 9, M(R4 + 8) Fetch Bey Load R10, M(R6 + a) Fetch dix Mul R10, R9 Az x Ba Add Ru, R10 Update Ci; Inc RB Increment k Inc RO Increment offset for Aux Add R4yn Compute kx n Cmp R3,n Check limit for k Jnz Loops Loop until limit is reached Store M(R7 +4), Ri Store value of Ci; Inc RP Increment j Inc RT Increment offset for Cy; Cmp R20 Check limit for j Jaz Loop? Loop until limit is reached Ine RY Increment i Add R5,n Compute i xn Cmp Rijn Check limit for i Ine Loop] Loop untit limit is reached (b) From the above code, the number of instructions is, T= 10n? +9n? +5n +3. For timing analysis, the following number of cycles is assumed for different types of instructions: «© Add, Mul, Cmp, Ine: 2 cycles. © Load, Store: 4 cycles. © Mov, Inc: 1 cycle. Based on the above assumptions, we obtain the following serial execution time: T, = (2an® + 14n? + 8n +3) cycles. ‘The average cycles per instruction can be calculated as Ty _ Dank 4 Mn? + Ont 3 a= Pant Mant Ont 8 tas / is = eT ovetas / instruction Asymptotically, CPI is close to 2.2 when n is large. (c} If the clock rate is 40 MHz, a rough estimation for the MIPS rate is oPl=32 Principles of Scalable Performance {a} Matrix A is partitioned into blocks by the row and matrix B by the columa, as shown in the following diagram: Cu Cia ToT Cw Ca | Cap | | Caw x Bow |=([D =e 7 Cys | Ow-12 [| Ow Cys | Cua Ch Each block A, represents the row vectors Ajaynjwaae through Ain ye. Similar notations are used for the block submatrices B. ‘The multiplication of one A block of size (n/N) x n and a B block of size 2 x (n/N) yields one subblock of size (n/N) x (n/N) in the product matrix @. ‘The amount of lime required for the computations in each processor is 22(n/IV)?n+ L{n/N))n+8n/N-+3, provided each processor is identical to the uniprocessor used in (a) and memory access conflicts are ignored. Each processor needs to compute NV such subblocks. Thus, the total parallel execution timeis (22n3 /IN+L4n?+8n+3N) cycles. The potential speedup is (22n? + 14n?48n+3) /(22n3 /N+14n?-+8n-+3N) ~ JV when m is large. (e} The matrix is partitioned as in part (4). Initially, node i has submatrix A;. and Bs for i= 1..N. In the first step, each node computes a subblock Cj; of matrix . “After that, the nodes exchange subblocks of B in the following manner: Node 1 sends its B block to node 2, node 2 sends its B block to node 3, ..., Node IV sends its B block to node 1. Then each node computes a subblock C,,.,; except node N which computes Cy. The process is repeated until all the subblocks of C’ are computed in N steps. If the initial distribution of B to the nodes is not counted, the number of message-passing steps is N ~ 1 ‘The sequence of computations is lustrated in the following diagram for N = 4, ‘with different shades indicating subblocks computed in different steps. Each block corresponds to a C subblock of size (n/4) x (n/4).Principles of Scalable Performance 33 (f) Assume each message consists of a single element of matrix B, which is 8 bytes for double-precision floating-point numbers. The message sending operation for node din step j (1 < j < V1) can be specified as follows: /* Process sending messages */ gej+i-a if (jj == N41) ij = 1do fork = (jj-l)+n/N+1tojjen/Ndo for] =1tondo if ( NN) send(1, B(i,k), 8); else send(i+1, BOK), 8) enddo enddo ‘The first parameter of the send instruction is the destination node ID, the second is the element of B to be transmitted, and the third is the length of the message. There should also be code in the node receiving the message; it is similar to the sending counterpart. For simplicity the code is not shown. By symmetry, each node must execute both sending and receiving instructions. ‘The execution time tne can be divided into the time for arithmetic operations (tq) and that for communication (t.), assuming there is no overlap between the two types of operations. The time for arithmetic operations is identical to that on a shared-memory multiprocessor, which is t, = 22n?/N + lin? + 8n+3N. The total number of message-passing operations is (N ~ 1) x (n/N) xn. Thus, ‘te = (22n°/N + 14n? + 8n + 3N + 100(N — 1) x (n/N) x n) cycles. ‘Therefore, the speedup is a 22n + Lin? + 8n +3. Bani ]/N + Un? + Bn + BN + 100(N — 1) x JN) KH Note that different assumptions for this problem will lead to different speedup results. It is also possible to use other matrix multiplication algorithms such as those described in the text. Problem 3.9 (a) Arithmetic mean exccution time of each machine is calculated as follows: ® Machine A: (1 + 1000 + 500 +100) /4 = 400.25 s. © Machine B: (10 + 100 + 1000 + 800)/4 = 4775 s © Machine C: (20+ 20+ 50 + 100)/4 = 47.5 5. (b) Harmonic mean MIPS rates: © Machine A: 100/400.25 = 0.25 MIPS. @ Machine B: 100/477.5 = 0.21 MIPS. © Machine C: 100/47.5 = 2.1 MIPS.84 Principles of Scalable Performance (c) In terms of harmonic mean execution rate, Machine C is higher than Machine A, which is in turn higher than Machine B. See also the discussion in Problem 1.6. Problem 3.10 (a) The total execution time in serial execution is ra) = Swe. a In parallel execution with m processors, the execution time is Wr vee. 2 ala Therefore, the speedup for the fixed-memory model is Tn) MG) Wy =0 fori #1 andi én, (i) Wy = Wa, (iil) WE = Gln)W,, then spa Mit Gin) Wi +GQ)Wi/n {b) When G(n) = I, ie., problem size remains fixed when memory size is increased, Arndabl’s law is obtained: Wr+Wr WIW, fn (e) Whea G(n) = n, ic., problem size increases in direct proportion to memory size increase (which in turn is proportional to the number of processors), Gustafson’s law is obtained: Sn 1 _ Wan, = Wis, (d) Let Wy =a, Wy = 1a. The relation 5,, < 5%, follows from the definitions. Thus, gti-a _atn(i~a) a+(i-a) n Satn(l~a)= 5. We now show $3 > $%. Assume G(n) = 9 > n. (3.4) (3.2)Principles of Scalable Performance 35 Let 8 =(1~-a)/n. Eqs. 3.1 and 3.2 can be rewritten as y _atnig Sn a Enp @a) and ot 908 (a4) ata Consider three different cases: La=1,6=0. 5,=S,=1 2. @=0,8=1 SL =S=n. 3. 01,g>n, which completes the proof. Problem 3.11 In a reasonable execution environment, the workload and exectition times should satisfy the following conditions: 1. At most n instructions can be executed by m processors simultaneously: T(n) < O(n) < nT(n), (3.5) 2. O(n) should be at most n times as large as O(1): (1) < O(n) B(n) by Eq. 3.6. From Eq. 3.5, we have a <1 = Rin) < 1/B(n) (3.19) ‘The proof is completed by combining the inequalities. Ulm) = R(n)BE36 Principles of Scalable Performance (b) The result is obtained by combining Eqs. 3.6, 3.9, and 3.10. fe) S(n}E(n) RG) TQ) TQ) 0G) Tin) nT(n) TQ) Ta) Ta) Fin) aT (n) O(n) TQ) nTX{njOwn) An) (from Eq. 3.8) (4) The following inequalities can be easily shown to hold for n > 1: Afn < (w+3)/(An) < (n+ 3)(n + logy n)f(4n2) <1 1 (n+ logyn)/n < An/(n+3) ) Re = 08x 10+0.2R: =8+0.2R, Milops Ry = 10°°R2? Milops. Ry = 1/(0.8/10 + 0.2/Rz) = 10R2/(0.8Rz + 2) Mitops. () Ry = 0.2 10-+0.8R, = 2+ 0.8R; Mflops. Ry = 10° R2* Mitops. Ry = 1/{0.2/10-+0.8/Ry) = 10g (0.2K + 8) Mflops.Principles of Scalable Performance 39 8 3 ge 8 (d) Suppose harmonic mean MIPS rate is used as the criterion to compare the relative performance of the three machines. For machine 1, the value is a) L 100 © fif100 + f2f0.1 ~ 1000 — 998); Similarly, for machines 2 and 3, the values are — fo+ RA c RP —%h and . Be TBE respectively. We can plot RO, R@, R® as a function of fy. The following diagram shows the variation of the harmonic mean MIPS rate for the three machines with respect to fx40 Principles of Scalable Performance It is seen that {is very sensitive to the value of f, RE varies slowly with ‘fr, and RG) remains constant independent of the value of f;. For most values of Ja, RO has a larger value than RC and RE. But when fy is close to 1, RO and RY surpass RO. A large value of f, means that most of the time is spent on the high MIPS benchmark for machines 1 and 2, leading to a high harmonic mean MIPS rate for the two machines. Problem 3.14 The communication cost of a data exchange between two directly connected nodes is modeled by @ + fm, where a is the time required to set up a channel, # is the time to transmit one word over the communication channel, and m is the amount of data exchanged. (a) In Fox-Otto-Hey algorithm, each processor in a Yi x fA torus is assigned s?/re elements of matrices A and B. The algorithm requires a total of \/n iterations, During each iteration, a subblock of A is broadcast along the row of the torus, ‘Therefore, in each iteration, the time taken for communication is /i(a + s?8/n). Since /7i iterations are required, the total time for communicaiton is (na + 5°). However, if the matrix subblocks are sent in a pipelined fashion (such as wormhole outing), the second term is reduced by a factor of 2/,/f (for details, please see {Foxs7}}, resulting in (ner + 2s%8//m) for communication overhead. If the torus is embedded in a hypercube and a sophisticated one-to-all broadcast scheme such as the one in [Johusson89} is used, the communication time can be further reduced to (logna + 2s°A/yn). Therefore the total: communication overhead on n processors is nlog na +25 /A8.Principles of Scalable Performance 41 (b) Berntsen’s algorithm is designed to take advantage of the higher connectivity in hypercube computers. Matrices A and B are partitioned into 2 strips by column and by row, respectively, as follows: Ba Ar | Az} As] Aad - c ‘The product matrix C is computed as a C= AB The hypercube is divided into 2* subcubes, each comprising 2°* nodes. The first step of the algorithm involves the computation of O; = A;Bj, which is carried out in each subcube using Cannon’s algorithm (Cannon69]. See solution of Problem 8.12 for an example. The communication overhead in this step is' T= 22042"). ” In the second step, C; in the subcubes are summed together using a “cascade sum" algorithm (Berntsen90}. This step requires communications among subcubes with an overhead 7 Tokar ep, ‘The total communication overhead is n(T; +73). Using the relation 2* = n¥/3, the complexity of the communication overhead is proved. (c) In Deke!-Nassimi-Sabni algorithm, multiplication is performed on a hypercube with 24 = s° nodes. Each node r in the hypercube is identified by a 3-tuple (i, 3h), 15 i,j,k $24 = 5, At the beginning, elements Aj, and By, are stored in node (0,,4). At the end, Cx is also stored in node (0, j,). ‘The algorithm consists of three phases. In the first phases, Ajx and Byy are replicated on nodes (i,3,4),1 Ty Here the workload parameter for Ry is omitted since it is assumed to be independent of workload. Based on the definition of isoefficiency, we have an NIWRi Lal NTw NEw and the isospeed condition is obtainedChapter 4 Processors and Memory Hierarchy Problem 4.1 (a) Processor design space is @ coordinated space with the x and y axes representing clock rate and CPI, respectively. Bach point in the space corresponds to a design choice of a processor whose performance is determined by the values of the coordinates. (b) The time required between issuing two consecutive instructions. (c) The number of instructions issued per cycle. (a) The number of cycles required for the execution of a simple instruction, suck as add, move, ete. (e) Two or moze instructions attempt to use the sume frnctional unit at the same time. (£) A coprocessor is usually attached to a processor and performs special functions at a fast speed. Examples are floating-point. and graphical coprocessors. (g) Registers which are not designated for special usage, as opposed to special-purpose registers such as base registers or index registers. (h) Addressing mode specifies how the effective address of an operand is generated so that its actual value can be fetched from the correct memory location, (4) In the case of @ unified cache, both data and instructions are kept in the same cache. In split caches, data and instructions are held in saparate caches, (5) Hardwired control: Controt signals for each instruction are generated by proper circuitry such as delay elements. Microcoded control: Each instruction is implemented by a set of microinstructions which are stored in a control memory. The decoding of microinstructions generates appropriate signals to control the execution of an instruction 4a44 Processors and Memory Hierarchy Problem 4.2 (a) Virtual address space is the memory space required by a process during its execu: tion to accommodate the variables, buffers, ete., used in the computations. (b) Physical address space is the set of addresses assigned to the physically available memory words. (c) Address mapping is the process of translating a virtual address to a physical address. (a) The entirety of a cache is divided into fixed-size entities called blocks. A block is the unit of data transfer between main memory and cache. (e) Multiple levels of page tables used to translate a virtual page number into a page frame number. In this case, some tables actually store pointers to other tables, similar to indirect addressing mode. The objective is to deal with a large memory space and facilitate protection. (1) Hit ratio at level i of the memory hierarchy is the probability that a data item is found in M;. (g) Page fault is the situation in which a demanded page cannot be found in the main memory and has to be brought in from the disk. (h) A hash function maps an element in a large set to an index in a small set. Usually it treats the input element as a number or a sequence of numbers and performs arithmetic operation on it to generate the index. A suitable hash function should snap the input set uniformly into the output set. (i) An inverted page table contains entries that record the virtual page number associated with each page frame that has been allocated. This is contrary to a direct ‘mapping page table. (4) The strategies used to select page or pages resident in the main memory to be replaced in case such needs arise. Problem 4.3 (a) A windowing system divides the register file on a machine into groups which are assigned to different processors. There is usually overlap among the register sets to provide a fast communication mechanism among cooperating procedures for parameter passing and to allow fast context awitching. The use of a large number ‘of GPRs allows less frequent memory accesses and speeds up program execution. (b) A large register file and a large data cache both serve the purpose of reducing memory traffic. From implementation point of view, the same chip area can be used for either a large register file or a large data cache. From programming point of view, registers can be manipulated by program code, but cache is transparent to the user. In fact, data cache is primarily involved in load/store operations. The addressing of a cache involves address translation and is more complicated thanProcessors and Memory Hierarchy 45 that of a register file Reservation stations and reorder buffers are used in superscalar machines to facilitate instruction lookahead and internal data forwarding which are needed to schedule multiple instructions through multiple pipelines simultaneously. (c) In most RISC processors, the integer unit executes load, store, integer, bit, and control transfer functions. {t also fetches instructions for the floating-point unit in some systems. The floating-point unit performs various arithmetic operations on floating-point numbers. The two units can operate concurrently. Problem 4.4 (a) The comparison is tabulated below: Tem CBC RISC Tnstruction | 16-64 bits fixed (32-bit) format __| per instruction format ‘Addressing 1224 limited to 3-8 modes {mostly register-based, except load store) ‘CPI 2-15, on the average 5 | < 1.5, very dose tol _| (b) © Advantages of separate caches: 1. Double the bandwidth because two complementary requests are serviced at the same time. 2. Simplify logic design as arbitration between instruction and data accesses to the cache is simplified or eliminated 3. Access time is reduced because data and instruction can be placed close to the functional units which will access them. For instance, instruction cache can be placed close to the instruction fetch and decode units. © Disadvantages of separate caches: 1. Complicate the problem of consistency because data and instruction may coexist in the same cache block. This is true if self-modifying code is allowed or when data and instructions are intermixed and stored in the same cache block. To avoid this would require compiler support to ensure that instruction and data are stored in different cache blocks. 2. May lead to inefficient use of cache memory becanse the working set size of a program varies with time and the traction devoted to data ‘and instruction also varies. Hence, the sum of data cache sie and instruction cache size is usually larger than the size of a unified cache. As a result, the utilization of instruction cache and/or data cache is likely to be lower. For separate caches, dedicated data paths are required for both instruction and data caches. Separate MMUs and TLBs are also desirable for separate caches to shorten the time of address translation. 4 higher memory bandwidth should be used for separate caches to support the increased demand.46 Processors and Memory Hierarchy In actual implementation, there is tradeoff between the degree of sapport provided and the resulting hardware complexity. (c) © Instruction issue: Scalar RISC processor issues one per cycle; superscalar RISC can usually issue more than one per cycle. * Pipeline architecture: In an m-issue superscalar processor, up to m pipelines may be active im any base cyele. A scalar processor is equivalent to a superscalar processor with m = 1. « Processor performance: An m-issue superscalar can have a performance rm. times that of a scalar processor, provided both are driven by the same clock rate, no dependence relation or resource conflicts exist among instructions. (€) Both superscalar and VLIW architectures employ multiple functional units to al- ow concurreut instruction executions. Superscalar requires more sophisticated hardwate support such as large reorder registers and reservation tables iu order to make efficient use of the system resources. Software support is needed to resolve data dependences and improve efficiency. Jn VEIW, instructions are compacted by compiler which explicitly packs together instructions which can be executed in concurrency based on heuristics or run-time statistics. Because of the explicit specification of parallelism, the hardware and software support at run time is usually simplified. For instance, the decoding logic can be simple. Problem 4.5 Only a single pipeline in scalar CISC or RISC architecture is active at a time, exploiting parallelism at microinstruction level. Operation requirement. is simple. In a superscalar RISC, multiple pipelines can be active simultaneously. To do 50 requires extensive hardware and software support to effectively exploit instruction parallelism, In a VLIW architecture, multiple pipelines can be active at the same time. Sophisticated compilers are noeded to compart irregular codes into a long instruction word for concurrent execution. Problem 4.6 (42) i486 is a CISC processor. ‘The following diagram shows the general instruction format, A few variations also exist for some instructions. FEET ost Td seta] a3ztcttnone sauszi6iinone o UIess2 0 ye eS epcodett ores) “inadrm” “ci ates inmedine FFalepeodeniy "bye Bye —_highcement oo, x pimas aa byes egies ass a ‘node specter Data format:Processors and Memory Hierarchy aT # Byte (8 bits): 0-255 * Word (16 bits): 0-64K © DWord (32 bits): 0-4G © bit integer (8 bits): 107 © 16-bit integer (8 bits): 10 ¢ 32-bit integer (8 bits): 10° # 65-bit integer (8 bits): 10" © &bit unpacked BCD (1 digit): 0-9 © &-bit packed BCD (2 digits): 0-9 © 80-bit packed BCD (18 digits): +10! * Single-precision real (24 bits): 41048 ‘* Double-precision real (53 bits): 1088 ‘* Extended-precision real (64 bits): 1044992 ‘» Byte string, Word string, DWord string, Bit string to support ASCII data types. (b) There are 12 different modes whereby the effective address (EA) can be generated: register mode * immediate mode # direct mode: EA — displacement «© register indirect or base: EA + (base register) © based with displacement: EA + (base register) + displacement index with displacement: EA + (index register) + displacernent @ scaled index with displacement: EA — (index register) x scale + displacement based index: BA + (base register) + (index register) © based scaled index: BA + (base register) + (index register) x scale « based index with displacemont: EA + (base register) -+ {index register} + displacement © based scaled index with displacement: EA + (base register) + (index register) x scale + displacement «© relative: New PC — PC + displacement. (used in conditional jumps, loops, and call instructions) (c} Instruction categories: @ data transfer: MOV dst, src arithmetic: ADD dst, sre * logic, shift, and rotate: AND dst, ste SHL dst, count ROL dst, count48 Processors and Memory Hierarchy string comparison: CMPS sdst, ssre dit manipulation: BT dst, bit control transfer: JMP addr Righ-level language support: LEAVE (procedure exit) protection support: LSL dst, src (load segment limit) floating: point operation: FADD tmp floating-point control: FINIT (initialize PPU) {d) HLL support instructions: BOUND reg, adds; check if (addr) < (reg) < (addr + S) ENTER imml6,imm8 _; make stack imml6 bytes at nesting level imm8 LEAVE SETec byte 3 set byte on cond, reset byte to 0 otherwise Assembly directives: commands to the assembler, not executable, For instance the following directives define e data eegment: DATAL SEGMENT DATAL ENDS (e) Interrupt, debugging, and testing features, » Interrupt: i486 can handle up to 256 different interrupts, $2 of which are reserved for Intel, the others can be designed by users. The starting address. of an interrupt service routine is called an interrupt vector. The interrupt vectors are stored in an interrupt vector table (IV'T). When an interrupt ‘occurs, relevant register values are pushed onto a stack. The interrupt nurn- ber is used by CPU to retrieve the corresponding interrupt vector from the IVT. After the interrupt service routine is executed, the program resumes execution of the interrupted instruction. « Two types of test are available: 1. built-in self test: tests nonrandom logic realized by PLAs, control ROM, TLB, and on-chip cache 2. external tests: can be performed on TLB and on-chip cache. © Three types of on-chip debugging aids are provided: 1, code execution breakpoint opcode: can be inserted at any desired breakpoint. 2, single-step capability 3. ende and data breakpoint capability provided by debug registers. (£) 80486 allows the execution of 8086 application programs in two modes: © Real mode: This has the same base architecture as.the 8086 but the program is allowed to access the 32-bit register set of the 80486. Default operand size is 16 bits. Paging is not allowed in this mode, ‘The maximum memory size is limited to 2° = 1 Mbyte.Processors and Memory Hierarchy 49 © Virtual mode: This mode allows 8086 application programs to take full advantage provided by i486. In this mode, i486 can execute any 8086, 80286, and 80386 software. Also, paging allows more flexible address mapping, The Iinear address space available is 2°? = 4 Gbytes and virtual address space is 24° bytes. (2) By setting PG bit (bit 31) of control register CRO to 0, paging is disabled. ‘This can be controlled by software. When paging is disabled, the linear address generated by segmentation mechanism is the same as the physical memory address and can be used directly to access the data from memory. Paging is used to cope with external fragmentation problem, but it also slows down the system. Applications which have stable memory requirement throughout the execution may use this feature (paging disable) to improve efficiency. By selecting a segment size of 4 Gbytes, the entire linear address space becomes a single segment, which essentially disables the segmentation mechanism. In this case, segment offset, linear address, and physical address are all identical. Segmen- tation provides a logical view of the memory space and facilitates protection and sharing of data, If the system is dedicated to a single application program which requires huge memory space, segmentation can be disabled. (i) Four levels of protections, called privilege levels, are provided: level 0 (PL = 0): kernel ‘most privileged level 1 (PL = 1): system services level 2 (PL = 2): OS extensions level 3 (PL = 3): applications Teast privileged Data stored in a segment with PL = p can be accessed only by code with PL < p. ‘A code segment. with PL = p can be called only by a task executing with PL < p. (j) Intel 1586 has been renamed to Pentium. No detailed information on the processor is available yet. Problem 4.7 (a) (1) The general format of an instruction in the i860 is shown below: BE 26.25 2420 IS TT HO 6 r [epoca [oss [om | oe [tometon anes woal ‘There are several variants to this format. Floating-point instructions also have a similar format, but provide bit fields for specifying precision of operands, pipeline mode, and dual-instruction mode. Data formats supported include © Load/store references 8, 16, 32, 64, 128-bit operands. ¢ Integer operations are performed on 32-bit operands.50 Processors and Memory Hierarchy ‘= Integer arithmetic operations support 8 and 16-bit operands by sign- extending the operands to 32 bits '» Floating point numbers follow IEEE 754 Standard (see Chapter 6) © Graphical pixels of 8, 16, and 32 bits are supported. However, regardless of pixel size, i860 always operates on 64 bits of pixels at a time. (2) Four basic addressing modes are supported: © Offset: absolute address into the first or last 32 Kbytes of the logical address space. © Register: operand in a CPU register. + Register indirect + offset: EA = const + (reg). ‘* Register indireot + index: BA «~ (regl) + (reg2) (3) Instruction categories: © Load/store instructions: 12X load integer «+ Register-to-register move instructions: inf; transfer integer to F-P register Integer arithmetic instructions: addu —; add unsigned Shift instructions: sh]; shift left Logical instructions: andnot ; logical AND NOT Control-transfer instructions: intovr _; software trap on integer overflow System control instructions flush; cache flush (b) i860XP executes hardware snooping instructions, whereas in the previous generation, i860XR, multiprocessor cache consistency requires software to avoid cacheing shared writable data. (c) Dual operation mode refers to the simultaneous execution, under the supervision of floating-point. control unit, of floating-point operations in the adder and mul- fiplier. Such operations ean be specified by dual-operation instructions such as ‘Sub-Multiply or Add-Muktiply. ‘Dual instruction mode refers to the capability for the integer unit and foating- point unit to execute instructions in parallel. Programmers can specify dual- instruction mode by using assembler directives or by explicitly modifying the opcode mnemonics,Processors and Memory Hierarchy 51 (4) i860 has a virtual address space of 2 bytes. ‘Translation of viriual address to physical address is optional and is in effect only when ATE (Address ‘Translation Enable) bit in the directory base register is set to 1 by the operating system. The format of virtual address is as follows: 31 2221 Ru ° Dir Page Ofiset The address translation mechanism uses the Dir field as an index into a page directory, which is pointed to by the DB (directory table base) field of the directory base register. The Page field is used as an index into the page table determined by the page directory. The offset field is used to select a byte within the page determined by the page table. Bach page has a size of 4K bytes. This address translation is illustrated in the following diagram: Page able Problem 4.8 (a) The allocation of registers is shown in the left-hand side of the following diagram when the total number of registers NV is 40.52 Processors and Memory Hierarchy (b) If. = 72, the registers can be organized into 4 windows’as shown in the right-hand side of the diagram. Note that in both figures, the eight globally shared registers are not shown. (c) The scalability of SPARC architecture refers to the number of register windows that can vary with different SPARC implementations. (a) A calling procedure can pass parameters to a subroutine by writing them into the OUT registers which overlap with the IN registers of the subroutine. Likewise, the results obtained by the subroutine can be passed back by Jeaving them in the OUP. registers which are the IN registers of the calling procedure. Problem 4.9 (a) Two situations may cause pipelines to be underutilized: (i) the instruction latency is longer than one base cycle, and (ii) the combined cycle time is greater than the base cycle. (®) Dependence among instructions or resource contlicts among instructions can prevent simultaneous execution of instructions Problem 4.10 (a) Vector instructions perform identical operations on vectors of length usually muc larger than 1, Scalar instructions operate on a number or a pair of numbers at a time. (b) Suppose the pipeline is composed of & stages and the vector is of length V. The first output is generated in the k-th cycle. Afterward, an additional output is generated in each cycle, The last result comes out of the pipeline in cycle (N + k— 1). Using a bave scalar machine, it takes Né cycles. Thus the speedup is Nk/(N + &— 1).Processors and Memory Hierarchy 53 (c) If m-issne vector processing is employed, each vector is of length N/m. Therefore, the execution time is (N/m + k ~ 1) cycles. If only parallel issue is used, the execution time is (N/m)k. Thus, the speed improvement is Nim+k-1_ Nk (jm) ~ W¥m(k-1) Problem 4.11 (a) The average cost is esi + 698 ats For ¢ to approach cz, the conditions are 32 >> 5; and 82 >> 151 (b) The effective access time is ta = Do fits = hath + (1— fa)hate = ht + (1 Adie. (c) If ty =rty. Then t. = (a+ (1—A)r)ty E=b/t, =1/(h+ (1- Ar). (d) The plottings are shown in the following diagram: 9) os) (e) fr = 100, we have B = 1/(h-+ (1 ~h) x 100) > 0.95. Solving the inequality, we obtain the condition: 4 5 I> grpg © 99-95%34 Processors and Memory Hierarchy Problem 4.12 (a) The average access time is yt + (1 hy)hate = hth + (1 — AYO, = (20 - 9h)ey Wh=07, then ty = 3.74 = 74ns. Ifh = 0.9, then t, = 1.91 = 38 ns. Ifh = 0.98, then t, = 1.18 = 23.6 ns. (b} The average byte cost is eis, + e282 _ 20eps1 ter x 4 r ats 31 + 4000 20x 0.23; +0.2% 4000 _ ds; +800 ‘1 + 4000 a + 4000 For 6; = 64,128, and 256, the average cost is 0.26, 0.52, and 0.43, respectively. (c) For the three design choices, the product of average access time and average cost is 19.24, 12.16, and 10.15, respectively. ‘Therefore, the third option is the best choice. Problem 4.13 In a system with private virtual memory, processors communicate with each other through message passing. The latency depends on the interconnection topology and channel bandwidth, As the system grows larger, latency becomes longer. ‘There is no data sharing among the processors. Therefore, data coherence is not a problem. Data may migrate from one node to another. But once a data reaches a destination node, it becomes private data to that node. In implementation, message passing is facilitated by a pair of commands (send and receive) or through remote procedure calls. Since message passing is essentially 1/0 operations, it is much more expensive than local memory accesses. As a result, applications which can be partitioned into tasks requiring little interaction with each other are suitable for implementation on such machines. In a system with globally shared virtual memory space, the private memory assaci- ated with individual processors forms a uniform address space visible to all processors. Data can be shared as in a shared-memory multiprocessor. Access Iatency may vary, depending on the physical memory location of the data. Some systems allow replication of data. to reduce the latency. Data can be migrated from one processor to another in pages or other logical units upon demand. If replicated data can be written by multiple processors, data coherence will be an issue which needs to be addressed. Actual implementation differs. Some systems allow read-only data to be duplicated {Li89]. Others allow replication of writable data as well. In either case, mechanisms must be provided to track the location of each data unit (page or object) and enable fast transportation of data. The complexity can grow rapidly with the system size. In spite of a globally shared address space, access time still varies with the actual location ofthe data. Therefore, applications with good spatial and temporal localities would be suitable candidates. Applications with less regular communication patterns can also be implemented, although the performance is likely to be degraded. In general, since readProcessors and Memory Hierarchy 55 operations are much cheaper than write operations, applications with high read/write ratios are particularly suited. Problem 4.14 (a) Inclusion property refers to the property that information present in a lower-level memory must be a subset of that in a higher-level memory. (b) Coherence property requires that copies of an information item be identical through- ‘out the memory hierarchy. (c) Write-through policy requires that changes made to a data item in a lower level memory be made to the next higher level memory immediately. (a) Write-back policy postpones the update at level (i +1) memory until the item is replaced or removed from level 1 memory. (e) Paging divides virtual memory and physical memory into pages of fixed sizes to simplify memory management and alleviate fragmentation problem. (1) Segmentation divides the virtual address space into variable-sized segments. Bach segment, corresponds to a logical unit. The main purpose of segmentation is to facilitate sharing and protection of information among programs. Problem 4.15 (a) LRU page replacement gives the following result, with » at the bottom of a columa indicating a page fault: 1 [0 [2 Tlé[7fofil2joysjotays [i = po rtle[rfotp2)opspolats —-{=|1 2tijifet7[oyrs2 [2 t3yotea SSt= op2;ejrfet7 {Tafa fe] spo 5 [274 Tl[e[7[2[4l2;7[3f3y2[3 1 [5 [2 6l[7[el7i2;al2r7i 7] 32 ps s[s[sjel7it}alaztatrtr apa yi a{als[sieresepatatapat * : efe i In each cycle, the miost recently referenced page is brought to the top page frame. As a result, the top row traces out the original page reference stream. The hit ratio using LRU is 16 / 23 (b) In circular FIFO scheme, the page frames are organized as a circular queue Q. The age frames in are referenced as Q(Z),0 < i S 3. A pointer P is used in conjunction with the usage bits U(I},0 << 8 to decide which page is to be replaced in case of a page fault. Initially the pointer points to the first free page frame (P = 0)56 Processors and Memory Hierarchy and the usage bits are all set to 0 (U(0 : 3) = 0). The behavior rules are specified below: © Page fault on page J: U(P) =1; P=(P+1) mod 4; © Page hit on a page resident in page frame J: )P=(P +1) mod4; ‘The update to the pointer in the event of a page hit is to avoid replacing the page innediately in an ensuing page fault. Based on the initial conditions and behavior rules, it is eeay to write a program {0 trace the contents of the page frames in response to the reference stream. In the following table, we show the evolution of the arrays and the pointer. An asterisk (+) at the end of a row indicates that a page fault has occured for a particular age reference. HX) 20) 2) 93) UO) VA) Ue) Ue) P -— = = 1 0 0 z 0 1 a 0 2 1 1 3 4 i 3 1 1 3 1 1 0 6 0 1 6 0 1 6 0 2 6 1 3 6 0 Oo « 6 0 0 3 0 1 3 0 2 3 1 3 3 0 0 1 Q 1 1 0 t 1 0 2 I 1 3 L 1 0 6 0 1 6 0 2 MBN KE SScOSCOSCCODOSSCOCS BARR RRO Ree URN | RE OR Ooh eRe o 0 0 ° 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 i 1 1 0 0 1 1 0 1 1 1 1 0 0 0 1 1 0 1 eer wmmmmwerm eras) | ||Processors and Memory Hierarchy 57 ococr He Hee HH coHHEHSS cocoon ooo Hen ocounn 307 For this particular reference stream, the hit ratio is 16 / 33, which is the same as that for the LRU scheme. However, the contents of the page frames are somewhat different for the two schemes. Note that different behavior rules have been proposed in literature, which may sive rise to slightly different results 4 1 Problem 4.16 (a) Temporal locality refers to the property that recently used data or instructions are likely to be reused in the near future. Spatial locality refers to the property that a process tends to access data or instructions stored in consecutive locations. Sequen- tial locality is related to the observation that the execution order of instructions tends to follow the sequential program order. (b) Working set is the subset of addresses or pages referenced within 2 given time window or a given number of most recent references. It approximates the program locality property. Pages in the working set are considered actively used and should reside in main memory. If the window size is large, the resident pages may encom- pass several locality regions and the size of working set is likely to grow. A large window size should improve hit ratio. But in a multiprogrammed environment, keeping a large number of pages in the memory for each process will exhaust. the page frames and cause thrashing. On the other hand, a small window size gives rise to a small working set and may lower hit ratio because actively used pages shift with time. (c) 90-10 rules: It has been empirically observed that 90% of the execution tiie is spent on approximately 10% of the code. The rule reflects program locality as a short segment of code gets executed repeatedly and the data accessed tend to be contiguous elements of a lange array structure. Problem 4-17 (a) The effective access time is tag = thy + fo(1 = hi Jltg = thls + to(1 = hy) = 0.954; +0.05t. (D) The total cost is ¢ = ers: + e365.58 (c) 1, We have the following inequality: Processors and Memory Hierarchy 0.01 x 512 x 1024 + 0.0005 x #2 < 15,000. ‘Therefore 5 cannot exceed 18.6 Mbytes, 2. The following inequality is obtained: 20 x 0.95 + 0.05 x ts < 40, Hence, t2 < 420 ns. Problem 4.18 [[Atiributes [Symbolic processing | [Numeric processing _] ‘Data objects: Lists, relational databases, Integer, floating-point numbers, sexpis, semantic nets, frames, | vectors, matries blackboards, objects, production systems Conition Search, sri, pattem wiat@ig,” | AGG, wabivaa, map ATS operations | filtering, contexts, partitions, | matrix multiplication, transitive closures, uniiction, | matrix-vector multiplication, text retrieval st operations, | reduetion operations ke dot reasoning. product of vectors, et Memory Targe mematy with Tateaive | Great memory deinand Wh requirements | access pattern. Addressing is | intense aces, Access petern often content-based. Locality of | usually exhibits high degre of relerence may not hol, tpatial and temporal oealives ‘Gommmonation | Message trai varies inane Mestage trafic and granary patterns and destination; granularity | are relatively uniform, Proper and format of metsage unis | mapping can restrit hange with application. | eammuniation to largely ‘tween neighboring prowasors “Rigorthm | Nendetezmmtsie, poaubly | Topically detrsinte Properties parallel and distributed Ameuable to parallel and Computations. Data etibuted computations, Data dependences may be global and | dopendence ie mostly local and irregular in pattern ane regu. | iramulaviy " Tapar Output | Topol ean be graphieaT and’ —} Tange date sats wnialy sited requirements | audio a wel as from keyboard; | memory capacity, Fast 1/0 ie acces to very large online| highly desirable I databases. [arsine] Pasa opas oT can bo hinaet ema Features Knowledge base, dynamic oad | processor, MIMD, or SIMD balancing: dynamic memory | processors using various allocation; hardware-supported_ | memory and interconnection | tarbage collection; stack structures. Systolic array i Drocassor architecture; symbolic | suitable for certain types of prone computations }Chapter 5 Bus, Cache, and Shared Memory Problem 5.1 (a) Maximum bus width is 8 x 20 = 160 Mbytes/s. (b) Memory access time is defined as the time a memory request is received by the memory unit to the time at which all the requested information has been made available at the memory output terminals (Hayes, 1988) At first, it takes 50 ns (1 bus cycle) for the address to be transmitted to the memory module. After the data is ready on the memory output port, it takes 50 ns to transfer one word to the processor. In the worst case, the four words are accessed one by one separately. ‘The total amount of time for a processor to access ‘one word from the memory is 50 ns + 100 ns + 50ns = 200 ns during which the bus cannot be used by other processors. ‘Thus, the effective bandwidth is 8 bytes/200 ns = 40 Mbytes/s, which is one-fourth of the maximum bus bandwidth. If the memory addresses are interleaved, 0 that. access of the four words can be performed simultaneously. It takes 50 ns to transmit the address to the memory module, 100 ns to get the data ready in the latches. Then it takes four bus cycles to transfer the four words to the requesting processor. Therefore, the total time required is 5 + 100+4 x 50 = 350 ns, Thus, the effective bus bandwidth is 4.x 8 bytes/350 ns = 91.4 Mbytes/s. (c) Any of the arbitration schemes discussed in the text can be used. The decision is based on the desired performance and circuit complexity. 5960 Bus, Cache, and Shared Memory (€) 40 address lines and 64 data lines are needed. In order to limit the total number of signal lines to 104, the address lines can serve as low-order data lines by use of multiplexers. ‘The other lines carry control signals such as bus request, bus grant, reset data sync, address sync, data ack, arbitration, read/write, etc. For a description of the functionality of each signal line, consult the specification of standard buses. (e) At least 21 slots are needed, one for each processor board, one for each memory board, and one for the bus controller. Problem 5.2 In a daisy-chained arbitration scheme, there is only a central arbiter. ‘One bus request line is connected to all processors. A single bus grant line is connected to processors in a daisy-chain manner, which means that a processor will acquire the bus only if none of the processors closer to the arbiter requests to use the bus. As a. result, the scheme works with a fixed priority based on the proximity of processors to the arbiter. ‘The advantage is its simplicity in installation; additional processors can be added to an existing chain by sharing the same set of arbitration lines. The simplicity also makes it feasible to install more than one set of the request and grant lines to improve system reliability, ‘The disadvantage is the violation of fairness principle by fixed priority assignment. Also, it takes a long time for the bus-grant signal to propagate along the chain. As a result, the number of processors that can be effectively supported is sinall. Ifa distributed arbiter scheme is used, each processor has its own arbiter to which unique arbitration number (AN) is assigned. When two or more processors request to use the bus simultaneously, the arbiters bid by sending ANs to a shared bus request/grant (SBRG) line whose logic selects the maximum among the ANs and leaves it om the line. Subsequently, each arbiter compares its AN with that on the SBRG in parallel. Only the request from the arbiter whose AN matches that on SBRG will be sustained. After the present transaction is finished, the selected processor will seize the bus. The advantage of using distributed arbiters is the flexibility of implementing various priority schemes and fast arbitration time. ‘The disadvantage lies in the complex arbitration structure which increases the implementation cost. Problem 5.3 (a) Assume low-order interleaving is adopted in the organization of the memory modules. Further assume that requests to all memory modules are equally likely. ‘There- fore, the probability ofa request by processor P, to any memory module M; is p/m, independent of i and j. The probability of no request to M; from P, is 1 — p/m. Hence, the probability of no request from any of the processors is (1— p/m)", and the probability of at least one request from any processor is 1— (1 — p/m)". If b> m, all the requesis can be satisfied by the bus system. Therefore, the memoryBus, Cache, and Shared Memory 61 bandwidth is estimated as follows: BW = 0(- (1p) )") = m (1 ~ (1 ~ p/m)") When n is large, (1 — p/m) n ‘The memory bandwidth is thus m(1—e-"*/™), (b) Memory bandwidth BW; is the expected number of busy memory modules or sue- cessful memory accesses in 2 multibus system with b buses. np is the expected number of memory requests generated by the processors. In general, not all memory requests can be satisfied because of conflicts arising from (i) more than one request being made to the same memory module, and (i) inability of available bus capacity to accommodate all the requests. The presence of conflicts means that not all the expected memory access requests can be successful. Therefore, BWs < np. However, as the authors showed in the paper, through proper choice af the design parameters, most of the memory requests from processors can be satisfied, i.e., BW. /np —1. Problem 5.4 For this problem, it is assumed that each cache miss (read or write) Jeads to the replacement of a block, which can be occupied or empty, in the cache to make room for the missing block. (a) White-through scheme: write0 5 reall) wnite(05), read(o5) 40008 200s (4004400) as (40420) as Effective memory access time: 0.95 (0.5 x (400 + 60) + 0.5 > 20) + 0.05 x (0.5 x (400 + 400) +.0.5 x (400 + 20)) 0.95 x 210 + 0.05 x 61 280(ns). 4 tar 462 Bus, Cache, and Shared Memory (b) Write-back scheme: Memary access cache hit0.95) cache miss(0.05) write) read(05) read(0) veie(0.5) 2008 i iny(0.1) —eanfo9) iny(0.1) (400+20)ns (4004400120) ns (400460) ns (4004004601 ns Effective memory access time: typ = 0.95 x (0.5 x 2040.5 x 60) + 0.05 x (0.5 x (0.9 x (400 + 20) + 0.1 x (400 + 400 + 20) +0.5 x (0.9 x (400 + 60) + 0.1 x (400 + 400 + 60))) 0.95 x 40 + 0.05 x 480 62(n5) (c) The memory access time per instruction is 46 us for write-through. 12.4ns for write-back. ‘Therefore, effective execution time per instruction is Ops +46 ns = 0.146 ns for write-through O41 ys +124 ns = 0.1124 ps for write-back ‘The effective MIPS rate for each processor is 1/0.146 = 6.85 for write-through. 1/0.1124 = 8.90 for write-back. ‘The upper bound of MIPS rate for the multiprocessor system is 409.6 for write-through. 16 « 6.85 142.3 for write-back 16 x 8.90Bus, Cache, and Shared Memory 63 ‘The above upper bounds are obtained by considering only the memory access time. In fact, itis difficult to achieve the upper bounds since the processors may not be fully utilized due to data dependence or resource conflicts among instructions. Problem 5.5 (a) Low-order interleaving refers to the organization of the memory in which the least significant (low-order) bits of memory address are used to select the memory module and the rest (high-order bits) indicate the offset of a word within the selected module. . (b) When data blocks in a cache are tagged and indexed by physical memory address, it is a physical address cache. In contrast, a virtual address cache does not wait for the physical address to be generated and is accessed by virtual memory address, It offers improved efficiency by overlapping cache access with physical address translation. The disadvantage is potential aliasing problem which entails frequent cache flushing, (c) In ashared memory, if the update to a memory location is observed by all processors at the same time, then the memory access is atomic. If the update is not necessarily observed by all processors simultaneously, the memory access is nonatomic. (4) Memory bandwidth is the maximum rate at which data can be transferred to or from the memory. It is determined by memory cycle time, bus width, and memory ongattization. Effective data transfer rate between memory and processors may be lower due to conflicts. Fault tolerance of the memory system is the capability to continue operation with 2 lower bandwidth when one or more memory modules fail Problem 5.6 (a) In a write-through cache, an update to a cache block causes the corresponding memory block to be updated immediately. In a write-back cache, the update to the memory block is postponed until the cache block is replaced. (b) Data which are globally shared among several processors and whose values may be updated can be tagged as noncacheable. Instructions, private dara, and globally shared readable data are tagged as cacheable. This distinction is an alternative approach used to avoid cache inconsistency problem, (c) Private caches are those attached to individual processors; shared caches are shared among, processors, much like shared memory modiles. These two types of caches can coexist in a system, For example, shared cache can be used as second-level cache in a multilevel cache system. (d) Cache Sushing is used to deal with aliasing problem in a virtual address cache. Cache flushing policies determine when flushing should be performed and the level at which flushing takes place (page, segment, context, etc.). Those policies are64 Bus, Cache, and Shared Memory closely related to operating system design (e) Cache hit ratio is affected by factors including cache capacity and block size. A Targe cache improves hit ratio. For a fixed cache size, there is an optimal block size at which hit ratio peaks. A small block size does not take full advantage of locality properties A large block size, on the other hand, may load unneeded data into the cache. In a set-associative cache organization, the number of sets and set size can also affect hit ratio. Problem 5.7 {) In order to preserve individual program orders, the first statement executed must be aor ore. Consider the case where a is executed first. A tree can be constructed, each branch of which traces ont an execution sequence that preserves individual program orders. The tree in the following diegram shows the interleavings of instructions and the corresponding ontput for each interleaving. 4 er comm o— 5 coun © So conn . < toe worn) anny nny e ee am 7 t— 4 won se anions a — + wom ere comm ae F cow ee Lear cay : se < __ r— & con . 4 + omy 6 am t— + can eS Se ay +— @ am © am be comms ) The 720 different execution orders cannot generate all the different combinations. For example, the combination 001100 cannot be generated by any execution order. The 11 pair at the center of the output requires that two of the assignment state- ‘meats (u, ¢, and e) be executed before the second output statement. Hence at least two of the three variables already have value 1 before the last output statement is executed, rendering it impossible to generate the last pair of Os. Tn fact, out of the 64 possible combinations, only 50 can be generated by the six statements executed in any order. Many different execution orders generate identical output sequences, as can be seen in (a). (c) The sequence 011001 can only be generated by either of the following two execution orders: efbeda and edafbc. Note, however, that neither of the two execution orders preserves individual program orders. Therefore, if individual program orders have to be preserved, then the sequence cannot, be generated. For a more formal proof, refer to [Dubois88] (d) Take as an example the sequence 001100 which can not be generated if memory ‘accesses are atomic. Suppose cach processor exceutes sequentially, but the change to variable values is not immediately observed by all the other processors. Con- sider the order of execution abcedf which does not violate program order of each individual program. First, the pair 00 is generated by . Then d will produce the pair 11, provided processor 2 has observed the changes made by processors 1 and 3. Finally, processor 3, which has not observed the changes to A and B by the other processors, executes f and prints out 00. Problem 5.8 The main memory blocks are numbered 0 to 63, the cache block frames are numbered 0 to 63. The mappings are shown in (a) through (d). In each case, the address format and cache tag are also shown.66 Bus, Cache, and Shared Memory (a) Direct mapping: (b) Folly associative mapping: EVEElEee] © © » © flcleleleeelte a aBus, Cache, and Shared Memory 67 (c) Set-associative mapping: (d) Sector mapping:68 Bus, Cache, and Shared Memory Problem 5.9 (2) Bach set of the cache consists of 256/8 = 42 block frames, and the entire cache has 16 x 1024/256 = 64 sets. Similarly, the memory contains 1024 x 1024/8 = 131072 blocks. Thus, the memory address format. is as shown in the following. figure: Qa — Cache address tap Se Wind adress addeess A block 5 of the main memory is mapped to a block frame in set F of the cache if F = B mod 64. () The effective memory access time for this memory hierarchy is 50 x 0.95 + 400 x (1 ~ 0.95) = 47.5 +20 = 675 ns. Problem 5.10 (a) The address assignment is shown in the following diagram: Memory aes ine (10 bis) oy My Me oy ° t 2 3 4 5 ‘ 7 Lol s 2 ” ii 2 2 « 5 6 1" ® @ 1008 1086 cs i0o8 1010 oH vo [018 106 von 1088 1020 03 om aBus, Cache, and Shared Memory 69 (b) There are 1024 / 16 frames in the cache, 64 blocks in the main memory, and 256 / 16 = 16 block (c) 10 bits are needed to address each word in the main memory: 2 for selecting the memory module and 8 for the offset of a word within the module. 6 bits are required to select a word in the cache: 2 bits to select the set mumber and 4 bits to select a word within a block. Besides, each block frame needs a 4-bit address tag to determine the block resident in it, (4) The mapping of memory blocks to the block frames in cache is shown in the following diagram: a Main Memory Tag (bits) 3]a/2)s}alelela|e|e After the set in which a memory block can be mapped into is identified, the address tag of the block frames in that set is compared by associative search with the physical memory address to determine if the desired block is in cache.70 Bus, Cache, and Shared Memory Problem 5.11 {a) Based on the given data, the following access tree is obtained: We have the following expression for the average access time fe = filhie (L—ha)(b-+e)) + (1— f.)(heet (1 ha)((B4+ (1 ~ fai) + (28-40) far) (b) IF the extra time taken by invalidation propagation ig taken into account, the average access time is te fe t(L— fi)finvi. Problem 5.12 (4) Cache organization and the relation between physical address and cache address are shown in the following diagrams.Bus, Cache, and Shared Memory n ‘Cache address tag Sa Byte address addcess {(b) (0000104 F};5 = (00000000000000000001000010101111)9. From the address mapping shown ia the above diagram, it is clear that the address can be assigned to any block frame in set 1. (c) Bits 27 and 28 in {FFF FTAzy):g must be 0 and 1, respectively, in order for the address to be mapped to the same set. as (0000104 F);¢. In other words, the least significant bit of z must be 0 and the most significant bit of y must be 1. The other bits can be either 0 or 1. ‘Therefore, x can be any of the hexadecimal digits {0,2,4,6,8, 4,C,E}, and y can be any element of (8,9, A, B,C, D, E, P}. Problem 5.1312 Bus, Cache, and Shared Memory (a) The effective CPI for each processor can be computed as cpl=mi4 2 = Limiz ze Therefore, the total MIPS rate of a system with p processors is pr T+ mtr =e MIPS = op {b) Using the expression derived in (a), we obtain the following equation: 320 St _ Ls6 1+ 042 ‘The equation is solved to give = 35/6 = 5.83 in MIPS. {e) Substituting the given performance data into the equation in (a), the following MIPS rate is obtained: 5.24 MIPS. 32x2 6 T¥16x1x2~ 42 Problem 5.14 (a) The effective access time for each memory access is ty = fil — Ri)te + fall ~ halt ‘The CPI in jes can be estimated as a(rnt, +4) 41 CPL =mte+ 4 4at, = 2 # ‘The effective MIPS of the entire system is thas se MIPS = Gp —— (mt, +4) +2 (b) Using the data given, we have the following values: ta = 0.5 x (1 ~ 0.95) x 0.5 40.5 x (10.7) x 0.5 = 0.0875 OPI = 0.4 x 0.0875 + # + 0.05 x 5 = 0.485. And finally, a oaas = * Hence, the number of processors needed is p = 13.Bus, Cache, and Shared Memory 73 (c) The cost of the cache is 4.7 x 16 x (32 +64) = 7219.2. Hence, the total amount of money allowed for the shared memory is 17781.8, and the memory capacity in Mbytes is. 17781.8 Cm = Faxed ~ Problem 5.15 (a) The address formats are shown in the following diagrams for the different design choices: vei — Go) cee Goo et —] ak en pees Coo (b) In case one memory module fails, the memory bandwidth is as follows: » Design 1: 0. © Design 2: 8 words per access. # Design 3: 12 words per access. (c) Ina fault-free situation, Design 1 offers the highest memory bandwidth in the case of vector access. But the entire memory system can be crippled by a single memory. module failure. The other two designs offer more graceful degradation in case of module failures, although the bandwidth is not as high as Design 1 under fault-free conditions. Problem 5.16 (a) All strides except multiples of 17: 80M words per second; strides of multiples of 17: 20M words per second. (b) All strides except multiples of 4: 80M words per second; strides of multiples of 8: 20M words per second; strides of multiples of 4 but not. 8: 40M words per second. Problem 5.17 (a) Using the formula in Bq. 5.1 of Problem 5.7, there are 20 execution interleaving orders that preserve individual program orders. Trees sitnilar to that given in Problem 5.7 can be constructed. These possible interleaving orders are: abcdef, abdeef, abdecf, abdefe, adbcef, adbecf, adbefc, adebef, adebjc, adefte, dabeef,74 Bus, Cache, and Shared Memory dabecf, dabefe, daebef, daebfe, daefbe, deabef, deabfe, deafbe, defabe. (b) If program order is preserved and atomic memory accesses are assumed, the following 4-tuple output combinations can be obtained: 0111, 1011, and 1111. (©) Suppose program order is preserved and nonatomic memory accesses are assumed. ‘Thea before c is executed, A has been set to 1 by @. Similarly, C is set to i before it is printed by f. Because of nonatomic memory accesses, the value of D in cand that of B in f are uncertain, Therefore the output can be erl or Tix, depending on the instruction interleaving. Here the don’t care bit x can be either 0 or 1. Possible combinations are 1001, 1011, 1101, 1111, 0110, 0111, and 1110. (Pattern 1111 appears in both cases and is shown only once), Problem 5.18 (a) Hardware complexity and implementation cost is reflected in the mechanism to determine whether a given block is in cache after the block address has been decided: Direct mapping has the lowest cost, since a simple modulus operation is sufficient. Fully associative mapping has the highest cost, since an associative search on all block frames is needed. The relative cost for set-associative and sector mappings depends on the implementation. In set-associative mapping, associative search is needed within each set; in sector mapping, it is neoded to determine the sector. For a fixed cache size, the size of each set or sector will make a difference in cost. (b} In direct mapping, block replacement is rigid and trivial. The other schemes allow similar flexibility in the design of replacement algorithms. For instance, all the replacement algorithms discussed in the text can be implemented with any of the three mappings. Ia the case of fully associative mapping, the algorithms are applied to the entire cache. In set-associative or sector mapping, only a subset of the cache block frames are examined in the application of the replacement algorithms. (c) Effects of block mapping policy on the hit rat ® Direct mapping: Hit ratio is strongly affected by the reference pattern. If the reference pattern leads to uniform distribution of working set in the cache, hit ratio will be high. But if two or more blocks mapped to the same block frame are referenced alternately, the hit ratio will drop sharply. * Fully associative mapping: Hit ratio is essentially independent of the reference pattern, Hit ratio should be high except in the rare case of anomalous Jack of localities in references. © Set-associative mapping: On the average, hit ratio should be higher than direct mapping and lower than fully associative mapping. Thrashing is still possible, but with a lower probability than direct mapping. * Sector mapping: Hit ratio is sensitive to the reference pattern. Because of the mapping scheme adopted, when a block in a sector is replaced, the other blocks in the same sector are invalidated, which effectively reduces he number of valid blocks resident in the cache. This is Wkely ta have anBus, Cache, and Shared Memory 15 adverse effect on the hit ratio. (a) « For the effect of block size on the cycle count and hit ratio, see the discussion in pages 236-238 and Fig. 5.14 in the text. # Set unmber and associativity: For a fixed cache size, the two parameters are inversely proportional to each other. When the cache number is sumall, it behaves more like a fully associative cache. .When the number of sets is large, its behavior is close to that of direct mapping and the hit ratio is expected to become lower. The actual performance is dependent on the characteristics of application programs. © Cache size: With a larger cache, more data and instructions can be held in the cache, which improves both hit ratio and cycle count. Problem 5.19 (a) A memory manager performs several functions: © It keeps track of the memory space being used by individual processes and their IDs. «@ It determines which processes to be loaded into memory when memory space is freed. © It allocates and deallocates memory space as needed. (b) Suppose a new block needs to be brought into memory. In nonpreemptive alloca~ tion, the incoming block can only be placed in a free memory block. In a preemptive allocation scheme, the incoming block is allowed to be placed in a block currently occupied by another process. Nonpreemptive scheme is easier to implement but preemptive scheme can make better use of memory space. (c) In aswapping system, an entire process (instructions and data) is swapped between main memory and disk. In other words, a process is either resident in memory or forced out of it in entirety. Examples are PDP-11 and early UNIX systems. (4) In a demand paging system, individual pages rather than entire processes can be swapped between main memory and disks independently. A page is brought into memory only when it is demanded. Demand paging has been implemented in recent releases of UNIX system. (e) Hybrid memory systems use a combination of swapping and demand paging in managing the memory system. Examples include VAX/VMS and UNIX System, v. Problem 5.20 (a) Lamport’s definition of sequeritial consistency (SC) gets rid of the concept of 2 global clock and relies solely on the ordering of events. The concepts of program76 (b) (e) Bus, Cache, and Shared Memory order and memory order form the foundation of various memory consistency models, developed subsequently. The conditions given by Dubois et al. are sufficient but not necessary conditions for iinplementing SC. The definition is centered around the abstract notion of memory operations in one processor being “performed with respect to other processors”. Sindhu et al.'s definition is more formal. A set of axioms based on the mathematical notions of total and partial ordering are used to rigorously specify the behavior of memory systems that satisfy SC. It also defines an atomic swap operation for the implementation of test-and-set which is used to guarantee mutually exclusive entry to critical sections. The similarity among the three SC models is the total ordering of memory events and the obedience of program order within each processor. ‘The DSB model of weak consistency (WC) imposes SC on synchronization operations only. Other store and load operatious are allowed lo proceed without waiting for the completion of one another as required by SC model. This allows a higher degree of parallelism to be realized. TSO model imposes program order only on store-store (write after write) operations. Load operations do not have to be visible to the shared memory provided they can be satisfied by a corresponding store operation in the write buffer. TSO model also allows a load operation to bypass write operations, PSO is derived from TSO by distinguishing store operations performed by an individual processor. In TSO, all the store operations in a processor have to be carried out in program order. But in PSO, only two types of store operations need to be performed in program onder: (1) Store operations explicitly separated by a store barrier (Stbar) in the program; (2) Store operations performed to the same memory location. In other words, stores which are to different memory locations and are not separated by Stbars are allowed to be executed out of program order. ‘As such, the write buffer in each processor is no longer a FIFO queue, This is similar to DSB weak consistency model and is likely to increase parallelism. The drawback is that the programmer has to determine where strict program order has to be followed and inserts Stbars in those places.Chapter 6 Pipelining and Superscalar Techniques Problem 6.1 (a) nk 15000 x5 _ 75000 E¥(n—1) ~ 5+ (15000—1) ~ 15004 Speedup = 4.9986. (b) [ke + (m ~ 1)} = 15000/25004 = 0.9997. ‘Throughput = nf /{k+(n—1)] = 15000x25x10° (instructions/s)/15004 = 24.99 MIPS. Problem 6.2 (a) The clock frequency of DEC Alpha is 150MHz. Comparing it with the 25MHz in ‘a base machine, the superpipeline degree is 6, Alpha issues two instructions every eycie. Therefore, its superscalar degree is 2. (b) Alpha has a huge virtual address space. Virtual addresses are 64-bit long, Alpha provides instructions for synchronization and cache coherence. This makes it, suitable for building multiprocessor systems. However, the scalability of multiprocessor systems is lower than that of multicomputer systems, Problem 6.3 (a) The superpipelined structure has extra startup overhead and higher branch penalty. See the original paper by Jouppi and Wall (1980). vr73 Pipelining and Superscalar Techniques (b) Under steady state, a superpipelined machine of degree m and a superscalar machine of degree n both can execute n instructions simultaneously. The superpipelined machine outputs one result every 1/n clock cycle, while the superscalar machine outputs n results every n clock cycles. Problem 6.4 The performance cost ratio can be expressed as 1 POR= Tale RR) Maximizing PCR is the same as minimizing its inverse. Let ko be the optimal number of pipeline stages. Then | m (ren) | oe \PCR)|,, whence, 7 t wagle thot) + (G+ Mh= 0, After some simplification, we get et Bre and Note era te 4 am \ PCR), ~ ‘Therefore, at ky, 1/PCR is minimum and PCR is maximum Problem 6.5 Lower bound of MAI, = the maximum number of checkmarks in any row of the reservation table. Upper bound of MAL = the number of 1's in the initial collision vector plus i. Detailed proof can be found in the paper by Shar (1972) Problem 6.6 (a) Forbidden latencies: 1, 2, and 5. Initial collision vector: (10011). (b) State transition diagram: Cit (c) MAL =3 (a) Throughput = 2 = 16.67 million operations per second (MOPS).Pipelining and Superscalar Techniques 79 (e) Lower bound of MAL = 2, The optimal latency is not achieved. Problem 6.7 (a) Reservation table: (>) State transition diagram {c) Simple cycles: (4), (5), (7), (81); (3,4), (8,5,4), (3.5.7), (1,7), (6,4); (5.7), G7), (2,84), (2,3,5,4), (1,8,5,7}, (13,7), (1943), (Lets4), (14,7), (5:3,4), (5,3,7), (5,3,1,7) Greedy cycle: (1,3) (a) 143 MAL = > fe) ‘Throughput = 5 Problem 6.8 (a) We can complete the computation in N-+11 clock cycles by the following sequence:80 Pipelining and Superscalar Techniques * cyde 1 Compute 4, +0. Feed A; to X and 0 to Y. Connect X and Y to the inputs of the adder # cycle 2: Compute Az +0. Feed Az to X and 0 to Y. + cycle 3: Compute As +0. Feed As to X and 0 to Y. cycle 4 Compute Ay +0. Feed A, to X and 0 to ¥. © cycle 5: Compute A; + As. Switch the lower switch to feed Z to the lower input of S1 from now on, and feed As to the upper input. # cycle 6: Compute Az + Ag, Feed Ag to the upper input of S1. © cycle 7: Compute Ay + Ar. Feed Ar to the upper input of $1 # cyde & Compute Ay + As. Feed Ag to the upper input of SI * cycle 9: Compute Ai + As + Ao. Food Ay to the upper input of SL cycle 10; Compute Az + Ag + Aro. Feed Ap to the upper input of SI. scyde 1: Compute As +Ar + Ans. Feed Ai: to the upper input of SI. + cyde 12: Compute Ay + Ay + Anz. Feed At2 to the upper input of SI. * cycle N-3: Compute Ay +Ag+A+...tAwzs. Feed Ay-s to the upper input of SL cycle N~2 Compute Ay + Ag + Ayo +... + Ay-2. Feed Aya to the upper input of $1. s cycle N—1: Compute As + Ar+ Ai +. + Aya. Feed Ay-1 to the upper input of SI. cycle N: Compute Ay + As + Aig +... + Ay. Feed Aw to the upper input of SI. e cycde N41: Store 7 (= Ay + As +... + Ay_s) to R and switch the upper switch to input R to was epper input of 1 from now on. # cycle N-+2: Compute Ai +ds+Aot..tAn-stArtAs+ Aiob ct AN-2, e cycle N+3: Store Z (= As + Ay +... + Ay-1) to R. s cycle N44: Compute Ast Art Anton Avast Act Ag | Aa bo Ane © cyde N+ 5: # cycle N+6: Store Z (= Ai + Ap + As + Ag +... + Awa + Aw-1) to R, # cycle N47: cyde N+8: Compute Ai + As + As +o. + Avast Aa + Ag + Ato +. $Ayat Ast Art An tot Anat Ant Ae t An tot Ay, cycle N +8: cycle N +10: cycle V+ AL cycle NV | 12: Result output from % which is the snm of all elements of A. (b) The N values are fed sequentially to a nonpipelined adder. Therefore, N& cycles are needed. The speedup is Nxk “N¥i1 5A)Pipelining and Superscalar Techniques 81 For N = 64andk=4 64 x4 * eayit Among the N-+12 cycles, 8 cycles (N+1,.N+3,N+5,N+6,N+7,N+9,N-+10, and N +11) issue useless instructions. Therefore, V +3 useful add instructions are performed. The efficiency is 54(64) =341 N+3 Nei’ m(N) = (64) = 67/75 = 0.89. (ec) mu(00) (4) Sa(Mij2) = S400) /2. 4A(Nij2) Mati Mau. Problem 6.9 (a) Forbidden latency: 3; collision vector: (100) (b) State transition diagram is shown below: (c) Simple cycles: (2), (4), (4,4), (1,14), and (2,4); greedy cycles: (2) and (1,1,4) (4) Optimal constant latency cycle: (2) ; MAL = 282 Pipelining and Superscalar Techniques 1 Throughput = 555 — = 25 MOPS. Problem 6.10 (a) Forbidden latencies: 3, 4, 5 ; collision vector: (11100). (b) State transition diagram is shown below: (c) Simple eyctes: (1,1,6), (2.6), (6), and (1.6). (€) Greedy cycle: (1,1,6) {e) MAL = 141+ 6/3= 267. (£) minimum allowed constant eycle: (6). (g) Maximusn throughput = (a) 1/(6r). Problem 6.11 The three pipeline stages are referred to as IF, OF, and BX for instrue- tion fetch, operand fetch, and execution, respectively. The following diagram shows the sequence of execution: ola ds «[s[s [te | mta_| sive | adé | sie k oF RO {Aze,R0| ce [Acs RO} Ace © Ex wo Paw Fear ace Bae At ty, O()) MIUp) = (RO) — RAW hazard. At t4, OU) O1Us) = {Acc} — RAW hazard. At ts, O(a) 9 IUs) = {Acc} —+ RAW hazard. ‘The following shows a scheduling which avoids the hazard conditions: ; |Pipelining and Superscalar Techniques 83, upe} el uel s&s} & i} oe] wpm tet feeb aw | tame oF ~ |S a Scrol Nae 7 w]e w Bie r Problem 6.12 (a) For the given value ranges of m and n, we know that mn(N—i) >N-1>N—m. Now, Bq. 6.32 can be rewritten as mn(N = 1) + mnk Simm) = Taal + mk From elementary algebra, we know that the right hand side of the above equation will attain the largest value when the term mark is smallest. As a result, the value of k should be 1 in order to maximize S(m,n). (b) Instructional level parallelism limits the growth of superscalar degree. (c) The multiphase clocking technique limits the growth of superpipeline degree. Problem 6.13 * Solution 1 (a) Reservation table: (b) Forbidden latency: 4. Collision vector: (1000). (c) State transition diagram:Pipelining and Superscalar Techniques (a) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2.5), (1,2,8.5), (1,2,3,2,5), (1,2.3,2,1,5), (2,5); (24,5), (21,258), (2,1,2,3,5), (2,3,5), (8,5), (3,2,5), (3,2,1,5), (,3,2,1,2,5), (5), (3,2,1,2), and (3). (e) Greedy cycles: (1,1,1,5) and (1,2,3,2) PH14i +5 (f) Map =A22454° (g) Maximum throughput = 1/(27). * Solution 2 (a) Reservation table: (b) Forbidden latency: 2, 4. Collision vector: (1010). (c) State transition diagram S) 1010 t/1\N\ Sint ys (a) Sinsple cycles: (3), (5), (1,5), and (3,5).Pipelining and Superscalar Techniques 85 (e) Greedy cycles: (1,5) and (3). (f) MAL =3. (g) Maximum throughput = 1/(3r) Problem 6.14 (a) The complete reservation table for the composite pipeline is as follows: 123 45 67 8 9 WM x x xX (b) Forbidden latencies: 8 1, 7, 9, 3, 2. Collision vector: (111000111). (c) State transition diagram: moon (d) Simple cycles: (5), (6), (10), (4,6), (4,10), (5,6), and (5,10). Greedy eycles: (5) and (46). (e) MAL = 5. (1) Maximum throughput = 1/(6 Problem 6.1586 Pipelining and Superscalar Techniques (a) X needs 400 (= 4 x 100) cycles to execute the program. It takes 16000 ns (400 x 40 ns). Y needs 104 { = 5 + {160—1]) eycles to execute the program. It takes 5200 1. Speedup = 18000 _ 5.08, 5200 (b) MIPS rates are computed as follows: 200, Xs Jgyy = 625 MIPS a = 19.2 MIPS. Bas Problem 6.16 (a) The five-stage multiply pipeline is depicted below: 2 ”Pipelining and Superscalar Techniques 87 (b) The maximum clock rate is r= Tm +d = 90+20 = 110 ns, (c) The maximum throughput = 1 / (110 ns) = 9.1 MOPS. Problem 6.17 (a) 1. Exponent subtract 2. Align 3. Fraction add 4, Normalize (b) From the solution of Problem 6.8, 111 clock cycles are needed. Problem 6.18 (a) The composite pipeline: agate 368) aie | 43mge M23) ‘nade ba (b) Connection of the third adder: — fsuge p>Chapter 7 Multiprocessors and Multicomputers Problem 7.1 Since requests are continually generated by the processors during each cycle, the bus never becomes idle. The memory requests are uniformly distributed across all the modules. Thus the probability that a memory module is selected is 1/m_ in each cycle. After a memory module is selected, it will be busy for ¢ cycles. Then it may be reselected or become idle for a number of cycles. The behavior of a memory module can be described by the following diagram: ‘Address latch in (1 eycie) ae \N ne ae 7 oer feel ‘The idle or waiting period can be modeled by a random variable x which follows a geometric distribution. The mean value of x ean be compated as follows: =do-sydy Let 7 Lydia Je) = ze - a From the theory of z-transform, we know that f(z) dz 8990 Multiprocessors and Multicomputers Let p=1/m and 9 whence, R= s'()= G-y # (a) The memory bandwidth delivered by the bus configuration is me (e+ m)r Using the given values for the variables, we obtain 7 x4 © Gea) x10 x 10> = 32 x 10" words / sec. (b) The fraction of time during which a memory module is busy is ctltm “orm Since there are m i modules is dependent memory modules, the utilization of all the memory 4x16 446 = 3.2. requests per memory cycle, Problem 7.2 (a) ‘The following diagram shows the crossbar network which connects m processors to memory modules.Multiprocessors and Multicomputers o1 aol wil 8 feo ee ie _ afl uh ‘The complexity of the crossbar network can be estimated as follows. At each crosspoint, there are 2 AND gates and 2 OR gates, But in the last row (processor n) and last column (memory module n}, we do not need OR gates for she read /write operation. Therefore. there are 2n? AND gates and 2n? —2n OR gates. In practice, each AND or OR gate in the diagram consists of w wu-input AND or OR gates as shown in the following diagram, In total, the number of two-input AND gates is 2n%w, and the number of two-input OR gates is (2n? — 2n)w (b) The schematic diagram of the arbiter is shown below:92 Multiprocessors and Multicomputers Aten; Aste fy tes y ¥ In case of conflicting requests to access the same memory module, the arbiter will grant priority to the processor with the smallest number. There are (n — 1) two-input AND gates along each column, leading to a total requirement of n(n —1) such gates. Problem 7.3 (2) The mappings are shown in the following figure. Zs Main memory Main menory jefe f-le 4 +|~|-[ol § 5 Diet mapping (i) Toss associative mingMultiprocessors and Multicomputers 93 (b) The results are shown in the following two tables; the first table corresponds to direct mapping and the second two-way set-associative mapping, In each table, an arrow connecting the same block numbers indicates that the corresponding access takes more than one cycle due to read/write misses or bus contention. In any case, at most 3 cycles are required to complete an access in the case of a read/write miss coupled with bus contention. The subscript associated with a block indicates the state of that block (R for read-only, and W for read-write.) feyele TpeTs aps ‘block tracelo— of 0] 01 = Frame O_|—[O, [Ow {Ow frame 1 _[=|—|—|—|-~ ipafframe 2 =|—|— |= |= frame 3 == Jeache miss] = o z [bus in use [» = * [block trace]? — 2) 2/0 — O10 frame 0 == lame T P2lframe 2 frame 3 ‘cache miss | ¥ = wT Le wusinuse| fe] fe] |e ep. 7] ]9 fiopiipis Tiss as On |— Tan lin [aa de vw flue SalSa 14 [is [a6 [17] 3 7, => 3/8 Fein lt =a ie [Sw iE 4 | fl 318, | ART 1 TS —hspat S| 0. 0, [2 Foe a aINe SI IP IE cycle [block trace] lirame 0 lirame 1 frame 2 frame 3 lcache miss ‘bus in use ie iP: iE SPIES] Iblock trace] Hrame 0 ramet Palframe 2 | [frame 3 Jeache mise [busin use | P+ | Ie is Ie aStsSyall STSTN] Ee SS el tet For the given page reference patterns, the hit ratio is 6/11 for Pl and 7/11 for P2 with either cache organization. ‘The major difference is the contents of block frames in the caches due to different ways of mapping between memory and cache. As can be seen, a memory block can possibly reside in more cache block frames with the set-associative organization, which generally improves hit ratio.94 Multiprocessors and Multicomputers Problem 7.4 A valid schedule must satisfy the following two conditions: (1) It does not violate the dependence relation specified in the diagram. (2) It does not cause resource conflicts. In other words, no processor or memory module can be allocated to more than one segment at a time, ‘There are many possible ways to schedule the program segments without violating the above conditions. A systematic approach is to use list scheduling as discussed in {Adam74]. The heuristics is to identify a critical path based on the memory latency and schedule segments on the critical path first under the data dependence and resource constraints. Py is demanded the most among all processors and is busy for 20 time steps. Moreover, none of the memory modules are requested more than 20 times. as shown in the following table: Memory module | Access frequency M My : Ms 15 My 12 | Ms 8 Ms | Total 70 ‘Therefore, the demand on Ps preciudes the possibility of a scheduling that can finish the (ask in less than 20 time steps. In the following table, we show one possible scheduling. Bach cell of the table contains a pair of numbers x,y with z corresponding to the instruction and y the memory module requested. According to condition (2) in the above, the value of y should be different in cells on the same row.Multiprocessots and Multicomputers 95 (mesa TR TR TAT r iz 33 [iLL a 22 [115 | 3,3 3 ap 22 [U1 4 ai [93 fis 3 LL 04 | 22 6 Tos [61 | 94 7 10.4 | 6.3 8 83 [10.4 9 ior 76 | 75 10 a2 | 44 ft 45 52 [13.6 i} 162 Was [144 13 | St [6,2 | 20,5 | 16.4 [| 20.3 | 16.2 | 204 15 12,2 | 12.6 | 12.5 16 | 223 | 183 | 185 17) 21.2 [221 | iss [ad is] 153 19.2 | 22.6 19 | 1927 15.5 19.1 20 19.2 | 285 | 23.5, | BA | waa Tas [PAA Based on the scheduling, 21 time steps are required and the average memory bandwidth is 70/21 = 3.33 words per time step. There are other schedules that yield an identical bandwidth. For instance, the pair 1.4 on row 3 can be moved to the same column on row 1 Note that in the above scheduling, an additional condition is satisfied; that is, once a segment is scheduled, it is continuously scheduled in consecutive time steps until completion without being disrupted. If the condition is relaxed, it is possible to obtain a scheduling with a total of 20 time steps Problem 7.5 (a) Three switch cells are needed to combine the inputs a5 shown in the following diagram: a — a Each switch box is able to perform the following functions (see [Stone90), p. 348):Multiprocessors and Multicomputers Match the addresses on upper and lower inputs. Add the two increments. Save one of the increments. Match a returning value for Fetch&Add to the saved increment. (b) The following figure shows a possible scenario of the data transfer between processors and a certain memory module (hot spot). | a |= ed Memory nowy HEE eves} ara] [| * Meany wai i) Co CEP oes on ees a i . | ae Trenance, HF “SJL voy i ic Teepe,Multiprocessors and Multicomputers 97 The final content of the memory location is the same regardless of the se- tialization of the increments ¢;. But the increment saved in the buffer of each ‘switch cell can be different, resulting in different values being returned to different processors: Problem 7.6 The m-way shuffle of n objects, where n = mbis defined by the following ror plot) me Qk+b with O< & ko, the demand for synchronization lines will exceed m. Therefore, the degree of multiprogramming should not exceed ko Problem 7.9 An important property of the multilevel bus/cache architecture is that any memory block which has a copy in the level-1 caches also has a copy in the level-2 cache, This inclusion property makes it possible to use level-2 caches as filters to avoid unnecessary traffic on the buses. Consider the use of a write broadcast protocol to maintain cache coherence of the system in Fig. 7.3, When a level-1 cache C;; writes to a memory block, the updated value is broadcast on the intracluster bus so that the other caches which have a copy of the memory block will update their data. ‘The updated value is also propagated up to Czo, which updates its copy of the memory block. Cao then broadcasts the new value on the intercluster bus. If a copy of the block exists in Czo (for instance), its value is updated. By the inclusion property, the memory block is likely to be resident in the cluster underneath Cz. Therefore, the data is also passed down to the intracluster bus and level-1 caches which have a copy of the memory block also update their values. ‘The relative merits of write-invalidate and write-broadcast protocols have been studied extensively, through either simulation or analyti¢ approach [Archibald86, Yang89]. See the discussion in the solution for Problem 7.19 below. Most comparison has been conducted on single-level caches, but the results should be applicable to hierarchical caches. Write-broadcas: protocols generally exhibit better performance, although actual performance depends on the memory reference patterns. An advantage of write- invalidate protocols is the relatively simple hardware support required. Problem 7.20 (a) The general trend in the industry is toward open systems which favor commercially available processors over proprietary design. ‘This helps reduce the cost and shorten the time of development. More effort can be focused on high-level design such as, interconnection structure and software development. (b) Increasing scalability is the main motivation.Multiprocessors and Multicomputers 101 (c) To avoid the problems of memory contention and/or cache inconsistency (if private memory or cache is used). (d) To offer more flexibility in using existing multiprocessor software. Problem 7.11 (a) A message is the logical unit for internode communication, It is often assembled by an arbitrary number of fixed-length packets. It may have a variable length. A packet is the basic unit of information transmission which contains the destination address for routing purposes. ‘A flit (flow control digit) is the smallest unit of information that a queue or channel can accept or refuse. (b) In a store-and-forward network, the basic unit of information flow is a packet. Each node has a packet buffer. A packet is transmitted from a source node to a destination node through a sequence of intermediate nodes. When a packet reaches an intermediate node, itis first stored in the buffer. Then it is forwarded to the next node if the desired output channel and a packet buffer in the receiving node are both available. (c) In wormhole routing scheme, a flit is the basic unit of information fow. Flit buffers are used in the hardware routers attached to uodes. The transmission from the source node to the destination node is done through a sequence of routers. All the fits in the same packet are transmitted in order as inseparable companions in a pipelined fashion. Only the header flit knows where the packet is going. All the data flits must follow the header flit. Different packets can be interleaved during transmission. However. the fits from different packets cannot, be mixed up. (4) A virtual channel is a logical link between two nodes. Tt is formed with a flit buffer in the source node, a physical chanuel between them, and a flit buffer in the receiver node. There are more than one virtual channels between two nodes. However, fewer number of physical channels are time-shared by all the virtual channels, (e) Buffer deadiocks may occur with store-and-forward routing in which no buffers are provided on the channels. A deadlock sitwation occurs when there is a circular wait, among the nodes and the buffers in the nodes are all full. Channel deadlock can ‘occur with wormhole routing when the channels used by different messages enter a circular wait. Both types of deadlocks are illustrated in Example 7.2 (£) When two packets seach the same node and they request the same outgoing channel, the cut-throngb routing scheme uses a packet buffer to temporarily store one of the received packets. When the channel becomes available later, the stored packet will be transmitted then. () When two packets reach the same node and they request the same outgoing channel, the blocking policy blocks the second pocket from advancing, However, the102 ° Multiprocessors and Multicomputers packet is not abandoned. (h) When two packets reach the same node and they request the same outgoing channel, the discard policy simply drops the packet being blocked. Packet retransmis- sion is required when the channel is available later. (i) In detour flow control, the blocked packet is rerouted to a detour channel. From there, another route may be found to reach the destination node. (J) © A virtual network is a network in which all nodes are connected by virtual channels. There are multiple virtual channels between two nodes. Hence, several virtual networks can be formed. + ‘The nodes in a network can be subdivided into several subsets. The nodes in a. subset and their connections form a subnetwork of the original network. Problem 7.12 (a) A 16x 16 Omega network using 2 x 2 switches is shown below: 00 1 oxo coor — caro }— «ro ont [ton 1 }— cin io: } _— ae) ous }— ono [— om 10 }-— 00 1001 — [ood f} sag —/ / }— ino @ fon 1100 ~ i = —— 1100, ner ior who J ino i nt‘Multiprocessors and Multicomputers 103 () 1011 — 0101 is indicated by — in the above diagram; 0111 ~+ 1001 is indicated by -. As can be seen, there is no blocking for the two connections (¢) Bach switch box can implement two permutations in one pass (straight or cross). ‘There are log, 16 x 16/2 switch boxes, Therefore, the total number of single-pass permutations can be computed as gf xlog, 16 _ 982 168 ‘The total number of permutations is 16!, therefore, Narer of single pass permmtations 16% 9 0o soa ‘otal number of permutations > Tel ~ 20510 (d) At most log; 16 = 4 passes are needed to realize all permutations. Problem 7.13 (a) A unicast pattern is a one-to-one communication, and a multicast pattern is a one-to-many communication. {b) A broadeast pattern is a one-to-all communication, and a conference pattern is a many-to-many communication, (c) ‘The channel traffic at any time instant is indicated by the humber of channels used to deliver the message involved. (4) The communication latency is indicated by the longest packet transmission tine involved (e) Partitioning of a physical network into several logical subnetworks. In each of the subnetworks, appropriate routing schemes can be used to avoid deadlock Problem 7.14 (a) (101101) — (101100) (101110) ~+ (201010) ~ (111020) ~» (011010). (b) Two optimal routing schemes under different constraints: © Routing with a minimum sumber of channels:104 ‘Multiprocessors and Multicomputers SOOOSOGO®@ ® ©®G@OOGO2O For this routing, traffic = 20, distance = 9. ‘* Routing with a minimum distance from the source to each destination: @ ®©® @® & loROMO MORO) ® ® & © © @@SGG008 For this routing, traffic = 22, distance = 8. ‘There are other routes with the same traffic and distance. (c) The routing is shown in the following tree:Multiprocessors and Multicomputers 105 oN. Are oso Hat not 101 ‘The paths are shown by heavy lines in the following diagram: Problem 7.15 (a) In a hypercube of dimension n, we denote a node as nz where k is an n-digit binary number. Node m has n output channels, one for each dimension, tabeled Cok, +++ €(n-2yk- The E-cube algorithun routes in increasing order of dimension. A message ariving at node n, destined for node ny is routed on channel cy, where i is the position of the least significant bit in which k and 1 differs. Since messages are routed in order of increasing dimensions, and hence increasing channel subscripts, there are no cycles in the channel dependency graph and E-cube routing is deadlock free. (b) There are four possible X-Y routing patterns corresponding to the east-north, east- south, west-narih, and west-south paths chosen. As in the 3 x 3 mesh shown in Figure 7.37, we can have two pairs of virtual channels in the Y-dimension. For each of the four routing pattems, no cycle will be formed in the channel dependency graph. Thus, the X-Y routing is deadlock-free. (c) Ina Kary n-cube, we denote the address of a node by nj where j is an n-digit, radix k number. The ith digit of j represents the node's position in dimension i. For example, the center node in the 3-ary 2-cube below is mi. A channel is106 Multiprocessors and Multicomputers identified by the address of its source node and the dimension it is in. For example, the dimension 0 (horizontal) channel from mix to nio is co1r To break cycles we divide each channel into an upper and a lower virtual channels. The upper virtual channel of co, is labeled coy11, and the lower virtual channel is labeled coo1). In general, virtual channel subscripts are of the form dur where d is the dimension, v selects the virtual channel, and z identifies the source node of the channel. To assure that a routing is deadlock free, we restrict it to routing through channels in order of ascending subscripts. ‘Asin the E-cube algorithm we route messages in increasing order of the dimensions, starting with the lowest dimension. In each dimension , a message is routed in that dimension until it reaches a node whose subscript matches the destination address in the ith position. The message is routed on the upper channel if the ith digit of the destination address is greater than the ith digit of the current node's address. Otherwise, the message is routed on the lower channel. This algorithm routes messages in order of ascending subscripts. Thus, it is deadlock-free. Ce t TD Doo Problem 7.16 (a) The turn model works by prohibiting @ minimum number of tums (change of Girections by 90 degrees) to prevent the formation of cycles. With the cycles broken, circular waits are removed and deadlocks are prevented. Formally, routing algorithms developed under this model allow a channel-numbering scheme in which the channels traveled by cach packet either increase or decrease monotonically. ‘This type of routing has been shown to be deadlock-free. See also the solution for Problem 7.15. (b) The authors have described three different routing algorithms for use with n- dimensional meshes: all-but-one-negative-first,all-but-one-positive-last, and negative. first. ‘These algorithms specify that a packet should use outgoing channels along certain directions before (or after) the others. As stated in (a), the algorithmsMultiprocessors and Multicomputers 107 have to be used in conjunction with special channel numbering schemes. For more details, see the paper. (c) A E-ary n-cube uses a torus connection along each dimension; ie., each node at the edge of a mesh has @ wraparound connection. One way to use the algorithms developed for meshes is to assign to each wraparound channel a number greater {or smaller) than any other channel along that direction in the mesh, depending on the routing algorithm used. Problem 7.17 (a) In multicast, the objectives are two-fold. One is to send a message to all the destination nodes, and the other is to do so efficiently. A tree can be constructed and used to determine the minimum subtree which covers all the destination nodes. ‘This is illustrated in the following diagram using the multicast pattern in Example 78. oa oon ona ‘The destinations are enclosed in boxes. To cover all the destinations from the source with a minimum number of edges (lowest traffic), the paths indicated by heavy lines are chosen, which are identical to the choice of the greedy algorithm in Example 7.8. The path has a latency of 4 and a traffic of 10. Note that. there are other alternatives to some of the nodes/edges selected. For instance, 1001 can be used instead of 1111, and destination 1010 can be reached from 1011 instead of 110. (b) ‘The greedy multicast algorithin provides a strategy to deterministically select intermediate nodes (called forward nodes in the paper) between the source and destinations, The selection is based on the distance between the addresses of the source s and a destination dj, which is the number of Is in ry = # @ dj, where © stands for bitwise exclusive-OR operation, ‘The design of the algorithm is such that each intermediate node on the path from § to 4, will reduce the number of Is in 7, by 1. In fact, the descendant nodes of each intermediate node are chosen according to the number of destination108 ‘Multiprocessors and Multicomputers nodes for which the goal is achieved. Therefore, if initially s and d, differ in 6 bit positions, the message will arrive at d; in 6 steps, which is the minimum possible mumber of steps on a hypercube. ‘The authors proved that the greedy algorithm also minimizes network traffic if the number of destinations is 1 or 2, but isslightly iitferior to the optimal algorithm when the number is larger than 2. Problem 7.18 In the write-once protocol, a block may exist in one of four states in cache: «Invalid: there is no copy of the block in the cache, Valid: an arbitrary number of caches can have this read-only block, and all the copies are identical, ‘© Reserved: data in the block has been locally modified exactly once sinee it was brought into the cache and shared memory is updated, and Dirty: data in the block has been locally modified more than once since it was brought into the cache and the shared memory is stale, ‘The write-once protocol is mainly characterized by the introduction of the Reserved state. A first-time write to a clean and potentially shared block results in a write-through to memory and it updates the main’ memory copy as well as the local copy. ‘The local copy becomes Reserved, which indicates an exclusive copy in the system and saves subsequent write invalidations. Each cache has a buswatcher which monitors the transactions on the bus. When the bus watcher detects an address on the bus which hits in the local cache with a dirty copy, it intervenes in the bus transaction by asserting the memorybypass signal to inhibit the memory from supplying the data. To facilitate rapid access to the address tags and state bit pairs concurrently with accesses to the address tags by the CPU, dual (identical) cache directories are used. Problem 7.19 ‘There are five states for cached blocks in the Dragon protocol: Invalid, Valid- Exclusive (only cached copy in the system; clean and identical with the memory copy), Shared-Clean, Shared-Dirty {write back required at replacement) and Dirty (only copy in caches and modified). ‘The Dragon protocol is a write-broadcast protocol as the Firefly protocol. As long as there exists more than one cached copy, writes are broadcast to other caches. One difference is the updates to shared blocks are also immediately reflected at main memory in the Firefly protocol, while the Dragon protocol introduces the Shared-Dirty state such that memory copy is updated oaly when the Shared-Dirty copy is replaced, The cache that performed the latest write to the shared block is in the Shared-Dirty state and is. responsible for supplying the block on misses in remote caches and for updating main memory on replacement In case of write hits on unmodified private blocks the Dragon and the Firefly are able to eliminate unnecessary overhead by changing cache state from Valid- Exclusive to‘Multiprocessors and Multicomputers 109 Dirty without inducing any bus transaction. On the contrary, the write-once protocol requires a single word to be written to main memory. ‘The distributed write protocols of Dragon and Firefly yields better performance than the write-invalidation of write-once protocol in the handling of shared data. This, is because the overhead of distributing written data to all caches having a copy is lower than repcatedly invalidating all other copies and subsequently forcing misses on the next references in those caches where the block was invalidated. ‘The performance of the Dragon can slightly exceed that of the Firefly protocol because the Firefly broadcasts writes to main memory as well as to other caches. There- fore, the performance of the Firefly may be affected by the long latency of the memory system. But the Dragon gains the performance at the cost of adding one more state Shared-Dirty and it becomes more complex compared to the simplicity of the write-once protocol. Problem 7.20 (a) When more than one input of a crossbar module wants to use the same output port, the output connection is granted to the input port with the smallest num- der. In Cedar implementation, there is a priority resolution logic in each output port. An arriving packet waits in the input queue if the output port is already busy or the input is not chosen by the resolution logic. Only when all currently conflicting requests have been resolved will any new request be allowed to enter the arbitration. In this fashion, high-priority input ports will be prevented from starving low-priority ones, In summary, a combination of first-come first served ‘queueing principle and a fixed priority based on input port number is used to resolve conflicts. See {Konicek9i] (b) See Fig, 7.10a in the text for a similar connection of a 64 x 64 network using 8 x 8 switch modules (c] See Fig. 7.10b in the text for a similar connection of a 812 x 512 network using 8 x 8 switch modulesChapter 8 Multivector and SIMD Computers Problem 8.1 (a) In the register-to-register architecture, operands and results are retrieved indirectly from the main memory through the use of a large number of vector or scalar registers. In the memory-to memory architecture, source operands, intermediate and final results are retrieved directly from the main memory. More registers are needed in a register-to-register architecture, and higher memory bandwidth is needed in the memory-to-memory architecture. (b An SIMD machine with n processors and a pipelined machine with m stages and 1/m clock period have the same performance {n results every basic cycle). However, the SIMD machine needs n times of hardware (ALU), and the pipelined machine needs n times of memory bandwidth. Problem 8.2 (a) The percentage of vector code in a program required to achieve equal utilization of vector and scalar hardware (b) The percentage of code in a program which can be vectorized. (c) A compiler capable of vectorization. (€) The instructions correspond to the following mappings: firs. or 92x Vj sy (e) A gather instruction fetches the nonzero elements of a sparse vector using indices. fi M AV x Va a112 Multivector and SIMD Computers A scatter instruction stores a vector in a sparse vector whose nonzero entries are indexed. PM xoM. (£) A sparse matrix is a matrix ia which most of the entries are zero. A masking instruction uses @ mask vector to compress or expand a vector to a shorter or longer index vector, respectively, corresponding to the following mapping: $Me XV Vi Problem 8.3 (a) The low-order interleaved memory can be rearranged to allow simultaneous access, or S-access, as illustrated in Pig, 8.12. In this case, all memory modules are accessed simultaneously in & synchronized manner. Again the high-order (n — a) bits select the same offset word from each module. At the end of each memory cycle (Fig. 8.1), 7m = 2* consecutive words are latched in the data buffers simultaneously. The low-order a bits are then used to multiplex the m words out, one per minor cycle. If the minor cycle (7) is chosen to be 1/m of the major memory cycle (9), then it takes two memory cycles to access 1m consecutive words However, if the access phase of the last access is overlapped with the fetch phase of the current access (Fig. 8.1b), effectively m words take only one memory cyele to access. If the stride is greater than 1, then the throughput decreases, roughly proportionally to the stride. (b) The m-way low-order interleaved memory structure shown in Figs. 8.2a and 83 allows m memory words to be accessed concurrently in an overlapped manner. This concurrent access has been called C-access as illustrated in Fig. 8.3b. The access cycles in different memory modules are staggered. ‘The low-order @ bits select the modules, and the high-order 6 bits select the word within each module, where 7m = 2* and a+b =n is the address length, To access a vector with a stride of 1, successive addresses are latched in the address buffer at the rate of one per cycle. Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle (8), as shown in Fig. 8.3b. If the stride is 2, the successive accesses must be separated by two minor cycles in order to avoid access conflicts. This reduces the memory throughput by one- half. Ifthe stride is 3, there is no module conflict and the maximum throughput (mm words) results. In general, C-access will yield the maximum throughput of m words per memory cycle if the stride is relatively prime to m, the number of interleaved memory modules. (c) A memory organization in which the C-access and S-access are combined is called G/s-access. This scheme is shown in Fig. 8.4, where n access buses are used with 1m interleaved memory modules attached to each bus. The m modules on each bus are m-way interleaved to allow C-access. The n buses operate in parallel to allowMultivector and SIMD Computers 113 Feien eye, Access te, Module O +f} — single word ‘Module rl. SoceeS oy | Mungtexer bea i) +t : nighrder ross bs : je j—»{ weave Lf | an a ‘Secrest (a) S-access organization for an m-way interleaved memory Meray Medes 7 Fetch Fan? Foon 3 vee ‘neces 1 necess? neceas 3 wn L_feent Fawn? Feten 3 one ‘ease 1 ‘eee? ‘ecess 3 sto |___Fan Fone Fatcng ase ‘acess # ecess2 ‘eoess 2 words words mors yest yee? yee yas ~ (b) Successive vector accesses using overlapped fetch and avcess cycles are B.1 The S-access interleaved memory for vector operands ac Pi14 Multivector and SIMD Computers Memory MOB] Data beter | (b) High-order m-way interleaving Figure 8.2 Two interleaved memory organizations with m = 2* modules and w= words per module (word addresses shown in boxes).‘Multivector and SIMD Computers 115 Memory address Register (6 bits) Word address: Module address MoM; ip My MgSO My MoI GIGI Mole: @} Ce} Go] Go) Ge] Gel Ge} fs Le] Gr} Ge} Ge] feo) fer] Ged ps ze] [es] [ae] Per) Pee} fae] [ao] Fa: 2] [eo] [ee] Des) [ee] Par) [ae] Feo ao] Par] Gey ae) Pa] fa) [aed Car |] fa} feo] Pac) [Ge] Ped feed [se go G5 Cet Cel be [er] Cee] Fes PoP EE fof (= Data Memory Data Register (a) Eight-way low-order interleaving (absolute address shown in each memory word) @= Major cycle 12 Gin = minor cycle ‘m= degree of intezleaving (b) Pipelined access of eight consecutive words in a C-access memory Figure 8.3 Multiway interleaved memory organization and the C- access timing chart.116 Multivector and SIMD Computers S-access. In each memory cycle, at most m-n words are fetched if the n buses are fully used with pipelined memory accesses. Processors Memories © Po zo & © A 5 zc f = 2 Figure 8.4 The C/S memory organization. (Courtesy of D.K. Panda, 1990) ‘The C/S-access memory is suitable for use in vector multiprocessor configurations. It provides parallel pipelined access of vector data set with high bandwidth, Special vector cache design is needed within each processor in order to guarantee a smooth data movement between the memory and multiple vector processors. Problem 8.4 The comparison is summarized in the following table: Glass Architecture | Performance Cost ] Fulk-scale ‘multiprocessor Si Glops | $2~ 25 million | Supercomputers multi vector pipeline pipeline chaining l High-end mainframes or | attached vector > 200 Milops | $1 ~ T mailllion near Supercomputers___| processor Minisupercomputers or | multicomputer > 100 Mifops | $01 ~ 15 milion supercomputing workstations | _ Problem 8.5 (a) A composite function of vector operations converted from a looping structure of linked scalar operations |Multivector and SIMD Computers 17 (b)- © The program constrnct for processing long vectors is called a vector loop. When a vector has a length greater than that of the vector registers, segmentation of the long vector into fixed-length segments is necessary. One segment is processed at a time. * Pipeline chaining links vector operations following a linear dataflow pattern. Vector registers are used as interfaces between functional pipelines. Continuous data flow is maintained in successive pipelines. (c) A synchronous program graph in which all nodes have zero delay. (4) A pipenet is constructed from interconnecting multiple functional pipelines through ‘two buffered crossbar networks which are themselves pipelined. Problem 8.6 (a) Figure 8.5 shows the CM-2 processor chips with memory and floating-point chips. Bach data processing node contains 32 bit-slice data processors, an optional loating- point accelerator, and interfaces for interprocessor communication. Each data processor is implemented with a 3-input and 2-output bit-slice ALU and associated latches and memory interface. This ALU can perform bit-serial full-adder and Boolean logic operations. ‘The processor chips are paired in each node sharing a group of memory chips. Each processor chip contains 16 processors. ‘The parallel instruction set, called Paris, incndes nanoinstructions for memory load and store, arithmetic and logical, and control of the router, NEWS grid, and hypercube interface, floating-point, 1/0, and diagnostic operations. ‘The memory data path is 22 bits (16 data and 6 ECC) per processor chip. ‘The 18-bit memory address allows 2! = 256K memory words (512 Kbytes of data) shared by 32 processors. The floating-point chip handles 32-bit operations at a time. Intermediate computational results can be stored back into the memory for subsequent use. Note that integer arithmetic is carried out directly by the processors in a bit-Serial fashion. (b) __ Special hardware is built on each processor chip for data routing among the processors. ‘The router nodes on all processor chips are wired together to form a Boolean n-cube. & full configuration of CM-2 has 4096 router nodes ‘on processor chips interconnected as a 12-dimensional hypercube. Each router node is connected to 12 other router nodes, including its paired node (Fig. 8.5). All 16 processors belonging to the same node are ‘equally capable to send & message from one vertex to any other processor at another vertex of the 12-cube. The following example clarifies this message passing concept, On each vertex of the 12-cube, the processors are numbered 0 through 15, ‘The hypercube routers are numbered 0 through 4095 at. the 4096 ver- tices. A processor 5 on router node 7 is thus identified as the 117th processor in the entire system, because 16 x 7 +5 = 117.118 Multivector and SIMD Computers lope lnsuton us tot cron sottomraes | f | Sane asee ves, |EEBA! | |e, |BAae Fyperae | ©) () [2 [) Fyperabe | CE) EI inroes | TYE] ee) | | me | Le =z 227 I jacdress| Floating-Point | ap| Mastna-Point Execution Memory ‘and Memo Pte] “tase” (Sing 9: Doutte Figure 8.5 A CM-2 processing node consisting of two processor chips and some memory and floating-point chips. (Courtesy of ‘Thinking Machines Corporation, 1990) ‘Suppose processor 117 wants to send a message to processor 361, which is located at processor 9 on router node 22 (16 x 22+9 = 361). Since router node 7 = (000000000111), and router node 22 = (000000010210)2, they differ in dimension 0 and dimension 4. ‘This message must traverse dimensions 0 and 4 to reach its destination. From router node 7, the message is first directed to router node 6 = (00000000110). through dimension 0 and then to router node 22 through dimension 4, if there is no contention for hypercube wires. On the other hand, if router 7 has another message using the dimension 0 wire, the message can be routed first through dimension 4 to router 23 = (000000010111) and then to the final destination through dimension 0 to avoid channel eon- fi Within each processor chip, the 16 physical processors can be arranged as an 8x2, 1X16, 4% 4, 4x 2% 2, or 2X 2x 2x 2 grid, and so on. Sixty- four virtual processors can be assigned to each physical processor. These 64 virtual processors can be envisioned as forming an 8 x 8 grid within the chip.Multivector and SIMD Computers 9 ‘The NEWS grid stands for the fact that each processor has a north, east, west, and south neighbor in the various grid configurations. Further- more, a subset of the hypercube wires can be chosen to connect the 2!? nodes (chips) as a two-dimensional grid of any shape. For instance, 64 x 64 is one of the possible grid configurations. Coupling the internal grid configuration within each node with the global grid configuration, ove can arrange the processors in NEWS grids of any shapes involving any number of dimensions. This flexible intercon- nections among the processors make it very attractive for routing data on dedicated grid configurations based on the application requirements, (€) Besides dynamic reconfiguration in NEWS grids through the hypercube routers, the CM-2 has special built-in hardware support for scanning or spreading across the NEWS grids. These are very powerful parallel operations for fast data combining or spreading throughout the entire array. Scanning on NEWS grids combines communication and computation. The operation can simultaneously scan in every row of a grid along a particular dimension for the partial sum of that row, or finding the largest or smallest value, or computing bitwise OR, AND, or exclusive OR. Scanning operations can be expanded to cover all elements of an array. Spreading can send a value to all other processors across the chips. A single- bit value can be spread from one chip to all other chips along the hypercube wires in only 75 steps. Variants of scans and spreads have been built into the Paris instructions for ease of access. (4) # In broadcasting, copies of a single item are sent to all processors. In CM-2, this is carried ont through the broadcast bus to all data processors at once. © Global combining allows the front end to obtain the sum, largest value, logical OR, etc., of values, one from each processor. © Data parallel programming provides the high-level programmer with the illusion of as many processors as necessary; one programs as if there were a processor for every data element to be processed. These are often described as virtual processors. Problem 8.7 (a) The X-Net interconnect directly connects each PE with its eight neighbors in the two-dimensional mesh. Each PE has 4 connections at its diagonal comers, forming an X pattern, similar to the BLITZEN X grid network (Davis and Reif, 1986). ‘A tri-state node at each X intersection permits communications with any of 8 neighbors using only 4 wires per PE. ‘The connections to the PE array edges are wrapped around to form a two- dimensional torus. The torus structure is symmetric and facilitates several important matrix algorithms and can emulate a one-dimensional ring with two X-Net, steps. The aggregate X-Net communication bandwidth is 18 Gbytes/s in the largest, MP-1. configuration,120 Multivector and SIMD Computers (b) The network provides global communication between all PEs and forms the basis for MP-11/O system. The three router stages implement the function of a 1024 x 1024 crossbar switch. Three router chips are used on each processor board. Bach PE cluster shares an originating port connected to router stage $1 and ‘8 Garget port connected to router stage S3. Connections are established from an originating PE through stages $1, S2, and $3, and then to the target PE. The full MP-1 configuration has 1024 PE clusters, so eack stage has 1024 router ports. The router supports up to 1024 simultaneous connections with an aggregate bandwidth of 1.3 Gbytes/s. (c) E. Each PE has a 4-bit integer ALU, 2 L-bit logic unit, a 64-bit mantissa unit, a 16-bit exponent. unit, and a flag unit. All these functional units can be simultaneously active at the same time. 2. The PE array communicates with parallel disk array through the high-speed 1/O system, which is essentially implemented by the 1.3 Gbytes/s global router network. Problem 8.8 (a) A fat tree is more like a real tree in that it gets thicker from the leaves. Process nodes, control processors, and I/O channels are located at the leaves of the fat tree. A binary fat tree was illustrated in Fig. 8.6. The internal nodes are switches, Unlike an ordinary binary tree, the channel capacities of a fat-tree increase as we ascend from leaves to root. Figure 8.6 Binary fat tree. ‘The hierarchical nature of a fat tree can be exploited to give each user partition a dedicated subtree, which cannot be interfaced with by any other partition’s message traffic. The CM-5 data network actually implemented a 4-ary fat tree as shown in Fig. 8.7. Each of the internal switch nodes is made up of several router chips. Each souter chip is connected to 4 child chips and either 2 or 4 parent chips,Multivector and SIMD Computers 121 Figure 8.7 CM-5 data network implemented with a 4-ary fat tree (Cour- tesy of Leiserson et al., Thinking Machines Corporation, 1992) To implement the partitions, one can allocate different subtrees to handle different partitions. ‘The size of the subtrees varies with different partition de- mands. The I/O channels are assigned to another subtree, which is not devoted to any user partition The I/O subtree is accessed as shared system resources. In many ways, the data network functions like a hierarchical system bus, except with no interference among pattitioned subtrees. All leaf nodes have unique physical addresses, (b) The fat tree can be subdivided into several subtrees. Bach subtree is assigned to a user partition, Each partition consists of a control processor, a collection of processing nodes, and dedicated portions of the data and control networks. (c) As shown in Fig. 8.8, the basic control processor consists of a RISC microprocessor (CPU), memory subsystem, I/O with local disks and Ethernet connections and a CM-5 network interface. This is equivalent to a standard off-the-shelf workstation- class computer system. The network interface connects the conizol processor to the rest of the system through the control network and data network Each control processor runs the CMOST, a UNIX-based OS with extensions for managing the parallel processing resources of the CM-5. Some control processors are used to manage computational resources in user partitions, Some others are used to manage I/O resources. Control processors are specialized in managerial functions rather than computational functions. For this reason, high-performance arithmetic accelerators are not needed. Instead, additional 1/0 connections are more useful in control processors. (4) As illustrated in Fig. 8.10a, vector units can be added between the memory bank and the system bus as an optional feature. ‘The vector units replace the memory controller in Fig. 8.9. Each vector unit has a dedicated 72-bit path to its attached memory bank, providing a peak memory bandwidth of 128 Mbytes/s per vector unit. The vector unit executes vector instructions issued to them by the sealar microprocessor and performs all functions of a memory controller, including generation and check of BGC (error correcting code) bits. As detailed in Fig, 8.10b,12 Multivector and SIMD Computers ey it Memory inemaes ou © 1 hi oe Standard Computer ¥ LAN Connection Figure 8.8 The control processor in CM-8. (Courtesy of Thinking Ma- chines Corporation, 1992) Memon’ HY snes || anys || any = tes (ee aMbytes Hl optional) |] (optional) _]] (optional) hs usec) Mem: Controker G4-it bus Rework teraz Control Network Data Network Figure 8.9 The processing node in CM-5. (Courtesy of Thinking Machines Corporation, 1992)Multivector and SIMD Computers 123 Memory || Memory | | Mem Memor BMbytes || 8 Mbytes || B Mbytes | | 8 Mbytes ee a Vector |] Vector |] Vector |] Vector Unit Unit Unit Unit 64-bit bus Network Micro: Interface RISC processor Control Data Network Network {a) Processing node with vector units Bus MBus interface Vector Instruction Decoder i | Pipelined eg Momony Memory (b) Vector unit functional architecture Figure 8.10 The processing node with veetor units in CM-8. (Courtesy of Thinking Machines Corporation, 1992)124 Multivector and SIMD Computers each vector unit has a vector instruction decoder, pipelined ALU and 64 64-bit registers like a conventional vector processor. Bach vector instruction may be issued to a specific vector unit or pairs of units, or broadcast to all four units at once. The scalar microprocessor takes care of address translation and loop control, overlapping them with vector unit operations. ‘Together, the vector units provide 512 Mbytes/s memory bandwidth and 128 Milops 64-bit peak performance per node. In this sense, each processing node of CM-6 is itself a supercomputer. Collec- tively, 16K processing nodes can yield a peak performance of 2°* x 27 = 2° Mflops = 2 Tops. Initially, the SPARC microprocessors are being used in implementing the control processors and processing nodes. As processor technology advances, other new processors may also be combined in the future. The network architectures are designed to be independent of the processors chosen except the network interfaces which may need some minor modifications when new processors are used. Problem 8.9 (a) An example of replication: If A and B are arrays and X is a scalar quantity, the statement A = B + X implicitly broadcasts X to all processors so that the value of X can be added to every element of B. (b) Besides sum-reduction, other important reduction operations include taking max- immum or minimum, logical AND, and logical OR. The following are examples of maximum-reduction and minimum-reduction: Maximum Reduction 4 1 1 6 9 4 5 Minimum Reduction 1] 2] 3 i rf o] oj a 0 so] s| of 2 2 aj 2} ats 2Maltivector and SIMD Computers 125 (c) Transposing a matrix, reversing a vector, shifting @ multidimensional grid, and FFT butterfly patterns are all examples of permutation. Here is an example of matrix transposition: Matrix Transposition 1] 2] 3} 4 ry tf 6] 4 1} 0] of 1 2| of s|2 é|slel2| ~ [al olel« 4] 2] 4] 5 4] 1} 2] 5 (4) The following are examples of maximum-prefix and minimum-prefix Maximum Prefix 4 1] 2] 3f4 1 reapala 2| | «| «| of 9 5 eLal sts] Minimum Prefix if 2] sf @ rafal a i| of of 1 1] of of 0 el s|ola} ~ | 6[s| sla 4] 2] 4] s Lal 2[2]2126 Multivector and SIMD Computers Problem 8.10 {a) Pipeline chaining for CVF execution: (b) Space-time diagram: aces! unas \_ tm \ Lat \ wea \ Molly Aas‘Multivector and SIMD Computers 127 Problem 8.11 (a) The 11 vector instructions needed to perform the given CVFs on Cray X-MP are shown in the follows: M(B: B+63)~ Vi M(C:C +63) V2 sx V24 V3 Vaivisva V4 M(A: A+ 63) ex V1 V5 Vax V23V6 Vo M(D: D +63) V2-Viav7 Vax VT4Vv8 V8 M(B: E +63) (b) Space-time diagram for the execution of the CVF code: Load 1 tit Fiowe | Ad vow vaevi (¢) Execution of the CVFs using pipeline chaining on Cray 1128 Multivector and SIMD Computers Tine ints peel of Cray X-MP Cray 1 = ————_ 8 oe beiaiin iad An +m + 28 1.25 for lange n. Problem &.12 (a) Average execution rate can be computed as a+(1-a) B= aR + a/R, 10 = yp ran (Mops). (b) The plot is shown below: rn ny peMultivector and SIMD Computers 129 (c) We have Hence a = 26/27 = 0.963, (4) With the given data, the following equation is obtained Ry Ry = 0.7(Ry = 1) 2, which can be solved to give R, = 3.5 Mftops. Problem 8.13 (a) The algorithm to compute the expression in a serial computer is shown below: s= Ay x By For i = 2 to 32 Do = st Aix B Enddo ‘There are 32 multiply operations and 31 add operations. The number of time units needed is 32 x 4 + 31 x 2 = 190. (b) The algorithm for the SIMD computer is shown below: Parfor j to 8 Do s(j) = Aij x Bi; /* 1 multiply operation */ Fori =2to4 Do s(j) = s(j) + Ay x By /* L muitiply and 1 add operations */ Enddo 8(j) = s(J) + (7 +1) /* 1 routing and 1 add operations */ (3) = 9(9) + 5() +2) /* 2 routing and 1 add operations */ s(j) = 9(j) + s( +4) /* 4 souting and 1 add operations */ Enddo ‘There are 4 multiply operations, 6 add operations, and 7 routing operations. The time needed is 4x 446 x 2+7x 1 = 35 cycles Problem 8.14 (a) A Cray Y-MP C-90 has 16 processors. Each processor has 2 vector pipelines. Each pipeline has a floating point multiply and an add unit which can operate concurrently. Therefore, two floating point operations can be performed each cycle in a vector pipeline. Total operations performed iu a.cycle are 16x2x2-= 64. It has a cycle time 4.2 ns. Hence, the peak performance = (64 floating-point operations) / (4.2 ns) = 15.2 Gfops.130 Multivector and SIMD Computers (b) An NEG SX-X has 4 processors. Bach processor has 4 sets of vector pipelines. Each set has two add/shift and two multiply/logical pipelines. Total operations performed in a cydle are 4x 4x 2x 2= 64. Its cycle time is 2.9 ns. Thus the peak performance = (64 floating point operations) / (2.9 ns) ~ 22 Gflops. {c) Both machines perform 64 floating operations per cycle as explained above. Problem 8.15 (a) Matrices A and B are both divided into blocks, each of size 8 x 8. Denote the blocks as Aj; and By, respectively, for 0 < i,j < 7. Cannon's algorithm for matrix multiplication is used in this problem. ‘The following diagram shows the initial distribution of matrices A and B among the PEs. The submatrix blocks are stored in a skewed maaner. The diagonal subblocks of A appear in the first column, those of B appear in the first row. Hoo [Aoi | Aco {Acs | Ave [Aas [Ave | Aor An {Ar | Ais | Ais {Ais | Ais | Arr | Ato Aaa | Aas | Ase { Aas | Aas |-Aar | Azo | An Aas | Ass | Aas | Aas | Ast [Aso | Ass | Ase Aus | Ass | Aue [Ast | Aso | An | Aaa | Avo ‘Ase | Ase | Ast | Aso | Aes | Ana | Asa | Ase ‘Ass | Act | Aco {Asi | Aco | Ass | As | Ass. Arr Varo [An | Ara [Ars | Are | Ars | Are Boo | Bir | Bu | Bss | Bas | Bss | Bes | Ber Bio | Ba | Biz [Bos [Bos [Bos | Bre | Bur Bay | Bas | By | Bos | Bes | Brs | Bos [| Biz Byo | By | Bsa | Bos | Bra | Bos | Bis | Bar Boo | Bsr | Beo [Bes [Bos | Bis | Boo | Bsr Bso | Ber | Bra | Bos | Bis | Bas | Bas | Bur Beo | Bry | Boo [Bis | Bos | Bos | Bas | Bor Bro | Boy | Bis | Bos |Buy [Bas | Bas | Bor Blocks of C are stored in the natural order in PEs as shown below. (Goo [Cox [ Cea [Cos | Cos | Cos | Cos | Cor [Gro [Ou [Cre [is | Cre | Cis | Cre | Cor Gao [Crs | Can | Cas | Cos [Cos | Cas | Car G20 | Cox | Ca | Coa | Cae | Cas | Cop | Car Cy [Ca | Cag | Cas | Cae | Cas | Cas | Cor C50 | Cos | Coz | Cos | Coa | Cos | Coo | Cor Goo [Cox [Cex | Con | ss | Cos | Cos | Cor Gro | Cn | Cra | Crs [Cra | Crs | Cre | Crr (b) ‘Phe overall algorithm is specified as follows for each PE:Mulltivector and SIMD Computers 1a For i=0 to 7 Do Compute the product of block submatrices of A and B residing in it and add the product to the part of matrix C. Pass block submatrix of A to its left neighbor in a wraparound fashion using shift operations. Pass block submatrix of B to its upper neighbor in a wraparound fashion using shift operations. Enddo Basically, in step 1 of the ith iteration, PExy performs the following computations: Ce = Cha + Ae tiss)mods Beis 3)modst where j is the initial column index of the block submatrix of A residing in PEs... It is straightforward to specify the detailed operations for the multiplication of two submatrix blocks in each PE. Steps 2 and.3 exchange matrix elements among the PEs, In the last iteration, they bring individual submatrices back to the PEs in which they are initially resident. Note that all the PEs perform identical operations on different data, in keeping with the SIMD mode of operation. {c) The multiplication of 8 x 8 matrix blocks in each PE and accumulation into C take 8° multiplication and 8* addition operations. Steps 2 and 3 require 8 shift operations each. Therefore, the number of cycles needed in each iteration is 2 x 8 42x 8? = 1152. The total number of cycles in 8 iterations is 9216. If the shift operations of the last iteration are omitted, 128 cycles can be saved. (a If data duplication is allowed, each block submatrix of A is duplicated along the row, and each submatrix of B is duplicated along the column by the following instructions: For i= 0t07 Do PEs in column i broadcast submatrices of A to other PEs in the same row. PEs in row i broadcast submatrices of B to other PEs in the same column. Enddo Now, each PE has all the elements needed to compute a subblock of C matrix and no further data movement is required. So the last step is for all PEs to compute the submatrix blocks of C simultaneously. The arithmetic operations are identical to those in (b). Possible saving in execution time comes from the reduction in communication overhead if broadcast operations can be carried out efficiently. Problem 8.16 The comparison of CM-2 and CM-5 is summarized in the table below; more detailed comparison can be found in relevant manuals.132 Multivector and SIMD Computers Machine [Architecture | Operation | Potential] Tmprovement Mode Performance M2 | 64K bit-slice | SIMD 10 Gfiops| processors, Hypercube OM Pi6K SIMD 2 Tops | mixture of SPARCs, | MSIMD. parallel Avary fat tree | Syne. MIMD | techniques Problem 8.17 The linear combination can be written as {yo,th doz) agxX9 + 01% +... + Ayoa5X1023 89 * (0.05 51,05--»21023,0) + @s X (20,3, Z1.t5-s T0081) Fo + 09a X (o,s028521,20235 -- F1023,1022) = (aot09 + a1203 +. + ar029% 93091, 071 9 + OLE bon + 102 1,1028) ~-Boti003.0 + a1F1029,1 +. + G1028-2025,1073) ‘Thus, we have the following equalities: om w= Vasey, fori =0,1,..., 2023. (8.1) Fad (a) From Eq. (8.1) in the above, we see that each element of y can be computed separately. Thus, each processor can be used to carry out one-fourth of the computations — processor £ computes elements (£ — i) x 256 through £ x 256 — 1 ‘Vector a is replicated in all processors, ‘The multiplier and adder in each processor are chained a8 shown in the following diagram: aN tb _t ra Without loss of generality, consider the operations performed by processor 0 In each processors, two auxiliary vectors are used. C is a vector of 4 elements whichMaltivector and SIMD Computers 133 are initialized to 0, Vector D(0 : 1023) is used to store intermediate results. Let's examine the computation of yo. ‘The computations are divided into two phases. In cycle 0, ag and zy are fed into the multiplier. After 4 cycles, their product appears at one input of the adder. After four more cycles, the value ao:te9 appears at the output and is routed back to the input port for C (see the diagram). ‘Thereafter, one more product term is added to Co in every four cycles. A similar situation holds for the other elements of C. For a description of pipeline chaining for this purpose, sce [Hwang84], pp. 279-280. After all the product terms for yp have been accumulated in the adder, elements of vector C have the following values: Cy = S228 agesazoerve for k 0,1,2,3. Just prior to the arrival of product terms for y at the adder, C(O : 3) are stored in D(di : 4i +3) one by one and C(0 : 3) are reset to zero. ‘This process is repeated for successive elements of y. In this way, paits of elements of a and z can be continuously fed into the multiplier in each cycle. ‘Thus, at the end of the first phase, D has 256 “segments” of 4 elements each. This phase takes 256 x 1024 + 8 — 1 = 262, 151 clock cycles. In the second phase, each segment in D is summed up to obtain one element of y. This can be done by first geuerating 256 pairs of partial sums (two elements in each segment are added). Then each pair of the partial sums is added to produce the final result. In the optima! case, the first four add operations can be overlapped with the last four add operations of the first phase. Therefore, the total number of cycles ueeded for phase 2 is 512 + 256 = 768. Consequently, the total nnmber of cycles for the multivector system is 262,151 + 768 = 262,919. Note that if the two phases of computations are interspersed, then the vector Dis not needed. But the timing is not optimal, (b) On a single processor without vector processing capability, the number of opera: tions is 1024 « 1024 multiplications and 1024 x 1023 addition. Each operations takes 4 cycles, giving a total of 6.384.512 cycles. Therefore, the speedup of the multivector system over the single scalar processor is 8, 384, 512/262, 919 = 31.89, which is close to the theoretical maximum value of 32. In the above analysis, pipeline startup time has been neglected and a very intelligent schedaler is assumed. Actual performance may be poorer when various overheads are taken into account. Problem 8.18 Suppose low-order interleaving is used so that consecutive elements of a vector are stored in contiguous memory modules, Without loss of generality, assume that the first element {element 0) of the vector is stored in memory module 0. Let s be the stride of a vector access, aud my = miss and na = ms be the indices of two different elements retrieved. Assume n, > ng. The memory modules in which the two elements reside are n; mod 17 and nz mod 17, respectively. Now (n mod 17} — (mz mod 17) = (nm: — nz) mod 17 = (sm) ~ sm) mod 17134 Multivector and SIMD Computers af o if (my — mp) mod 17 = 0 = \ ((s mod 17)((m; — mz) mod 17)) mod 17 #0, if (my — mz) mod 17 #0 ‘The second result follows from the given condition s mod. 17 # 0. Therefore, if (my — mz) mod 17 #0 for any pair m; and mp, there will be no conflicts in memory accesses, i ‘Normally, if the elements are accessed in increasing order and at most 17 elements are accessed at a time, then (1m, —_mz) mod 17 #0, which ensures confict-free accesses,Chapter 9 Scalar, Multithreaded, and Dataflow Architectures Problem 9.1 (a) The efficiency can be computed as a/R RSL Zr (b) The new rate of remote memory requests is R’ 1—A)R. Hence, 1 1 T+RL i+ RIG—aj (©) EN > Na= ago th 1 i¥G-AjeRr HNN, Ba=— NR NL in TR + CFE V4RCHRL 1+ (1-AE +O) (4) To compute £, we need to compute mean internode distance D. Let P be the probability that a node sends a message to all other nodes with distance i. In reference to Problem 2.11, D can be computed as follows: 285136 Scalar, Multithreaded, and Dataflow Architectures Sinee D = 27/2] = r, rt4 aire itl or-itl _ Ar-i+)) 2 D= SE Aiea eel) Wey yee ar+) TDG btm 1/Rt 1 Ens WR +C~ 1+0—hCR N Fin = TEU HREOC) 1+ (1- AYRE +t] +0) Problem 9.2 The architectural assumptions and notations used in this problem are similar to those in (Saavedra90]. A deterministic model is adopted in the analysis, Summarized below are basic system parameters to be used: «Ns The number of threads or contexts that can be executed simultaneously in each processor. * C: The content switching overhead which accounts for the cycles lost in performing context switch in a processor. © L: The communication latency for a processor to access a a remote memory throngh the network. +R: The run length of a single thread before it issues a memory request or is switched ont. Note that the definition here is the inverse of the definition of in Problem 9.1. © fe The coverage factor for prefetching, defined as the percentage of memory Tequests successfully prefetched to satisfy the demand of a thread » E: The processor efficiency, defined as the percentage of time a processor is actively exeenting a thread. (a) Effectively, prefetching reduces the memory latency from L to L! (L' < L). a memory request has been prefetched, the time spent on the request equals V +L’, where V is the overhead for prefetching, which includes the effects of extra instruc. tions inserted to perform prefetching, (Assume software prefetching technique is used.} The processor efficiency F of a single-threaded processor with prefetching ‘can be expressed as R Fei s Geek (o.4) ‘The latency for remote access is reduced from L to f(V + L') + (1 ~ f)L.Scalar, Multithreaded, and Dataflow Architectures 137 (b) Based on the reduced latency, two different regions in the efficiency curve can be identified: R SV 4LN40- NL pul ee : 2" Re Bie +1 R _ tw < fVFHeO-Ne FV FT)+0-ALERIC R+C Problem 9.3 For this problem, the same parameters as defined in Problem 9.2 are used. (a) The major benefit of release consistency lies in allowing the read requests to bypass outstanding trite requests and allowing write requests to be pipelined. Therefore, the processor stalls only for read requests or when the write buffer is full. This probability of a write buffer being full is usually very low if the write buffer is large enough in capacity. Let w be the probability of a request being a write. The processor efficiency with release consistency alone can be expressed as: R E=———*___., REO web where b is « parameter that depends on the buffer capacity, network delay, w, and the rate at which remote memory accesses are requested by each processor. (2) (b) With release consistency model, the number of threads needed to completely hide the latency is (= w)L + wb N, Reo *? Thus, the efficiency is R La > a NR a vee 3 ign Vth) tO G-Qr+ubs RHC REC (c) Ifprefetching and release consistency are both employed, the latency will be further reduced. Combining the results in Eqs. 9.1 and 9.2 above, we obtain the following expression for the effective latency in a single-threaded processor: Ley = f(V 4) 4 (1 NA w)L + ub) ‘Thus, the efficiency of a single-threaded processor is ae ~RiLg Based on Leg, we can determine the number of threads needed to fully hide the latency in a multithreaded processor as IV +E) += PC -w)L +d) R+C Ex Nor =138 Scalar, Multithreaded, and Dataffow Architectures Hence, the efficiency in a multiple-threaded processor is R Emad RC wal FUL) + (1 FG wh +d) +R +C if N= Myr NSN, Problem 9.4 (a) We know m processors are attached to each column bus, since there are m row buses in the system, Bach generates r requests per second om the bus. Thus, the total request is mr. Suppose each request consists of t bits, assuming a uniform length for all requests. (Alternatively, ¢ can be taken as the average length of each request.) Then the following relation holds: mrt = Ba. ‘Therefore, the memory bandwidth is Ba mt @ (b) Assume all the buses (row or column) have the same bandwidth. There are 2m buses in the system. Hence, the total bus bandwidth is 2mB = 2m?*rt/a. (c) There are m? processors in the system, cach generating r requests per second. If each request uses only a row bus or column bus, then the total bandwidth requirement is mrt. This has to be satisfied by the available memory bandwidth, which is 2mB. Therefore mort < 2mB. Hence, ro 2B > int (a) If all the processors send requests that need to go through two buses (one cohumn bus and one row bus), then at 2 certain instance of time, there would be mr + intr = 2mér requests that need to be serviced by the bus system. Therefore, the total bus bandwidth needed is 2m?rt (e) The bus bandwidth of the multicube system is designed to allow a bus utilization rate of at most I (i.,0 < a < 1), In (d}, the bus bandwidth requirement represents the maximum bandwidth demand, From the relation 2m? rt < Imért/a, it is concluded that the available bus bandwidth provided by the multicube is adequate. Problem 9.5Scalar, Multithreaded, and Dataflow Architectures 139 (a) See Fig. 4 in [Hwang91}. (b) Because of the column and row access modes available on the OMP, special instrue- tions are needed. Please refer to the original paper [Hwang91] for a description of the instructions, data distribution, and SPMD program for performing matrix multiplication. (c) The number of orthogonal memory accesses is 2V3/n? + 2N?/n + N2/n?. The number of synchronizations is 2N?/n®. For details, see the proof of Lemuna 1 in (Hwang91] (a) Two-dimensional FFT requires 4N? /n? orthogonal memory accesses and one syn- ‘hronization. For a description of the SPMD program and complexity analysis, see (Hwang91] Problem 9.6 (a) ‘The SVM retains the programming paradigm of a tightly-coupled shared-memory anultiprocessor, which directly supports data sharing among processes. This pro- motes portability of programs across systems, In addition, it has the advantages of a distributed memory machine. The large virtual address space allows programs to be much larger in code and data space than the physical memory on an individual node. Moreover, remote memory can be used as an added level of the memory hierarchy between the local memory and disks to improve the performance, Thus, SVM provides such desirable properties as low cost and scalability by getting rid of hardware bottlenecks. (b) in SVM systems implemented by OS (such as IVY), it is convenient to use the underlying virtual memory page size as the unit of sharing among processes. In hardware-implemented SVM systems (such as Dash), the unit of sharing is usually much smaller, typically the size of a cache block. Some of the differences are listed below: Page-tevel sharing is more effective for exploiting locality of references in shared-memory processes. But it is also more susceptible to contention among processes (more than one process trying to access the same page) © Page-level sharing is more likely to cause false sharing, That is, two processes may access completely different parts of a page. @ The size of directory is much larger if the unit of sharing is cache blocks instead of pages. The storage demand of directory information can be ex Problem 9.7 (a) © To implement RC, it needs two memory instructions (load-lock and store: unlock) and a lockup-free cache and some kind of scoreboarding to keep track of outstanding requests.140 Scalar, Multithreaded, and Dataflow Architectures ‘« Toimplement PC, it needs multiport memory to allow processors to perform out-of-order writes. « To implement WG, it needs store buffers in each processor with some match- ing hardware to bypass loads. (b) Different consistency models impose different constraints on the order of shared memory accesses by each process. The following diagram adapted from [Gharachorloo9]} illustrates the event ordering according to PC, WC, and RC models. In the figure, L stands for load, $ for store, A for acquire, R for release; an arrow means program order has to be observed; loads and stores in the same block can be executed in any order provided dependence relations are respected. Subscripts to acquire and release operations stand for synchronization variables or memory locations. 5 | | bs Blea He bis -Bhede | c Some of the advantages and shortcomings of each model are summarized below. See, for instance, [Gharachorloo91, Mosberger93] for further discussions. «Advantages of PC: Loads are allowed to overtake store accesses by the same processor if the accesses arc to different locations. If a load and a store fare to the same memory location, the load can be satisfied by the storeScalar, Multithreaded, and Dataflow Architectures wat operation, as in TSO or PSO model. Thus, a load never stalls for pending stores. ‘* Shortcomings of PC: Store operations in each processor have to follow program order, making the chance of a write buffer being full higher, which means the processor is more likely to be stalled. * Advantages of WC: WC ensures sequential consistency only at synchronization points. Load/store operations between synchronization points can be performed in any order as Jong as control and data dependences are not vi- olated in each processor. Hence, store operations can be pipelined, leading to improved performance. # Shortcomings of WC: Processor is stalled at an acquire operation, waiting for previous stores and release to complete. It is also stalled at the first load following a release operation. As a result, in fine-grain computations with frequent synchronizations, WC can perform poorly compared to PC. © Advantages of RC: The shortcoming of WG for fine-grain computations is eliminated since RC does not block a processor at a load /acquire for previous store/release operations to complete. Independent synchronizations do not noed to wait for the completion of each other as shown in the diagram. ‘Therefore, a higher degree of parallelism can be realized. @ Shortcomings of RC: RC requires more complex hardware/software support for implementation (see {a}. Special language construct and compiler support are needed to properly label a program and generate the code for execution in this model. Problem 9.8 (a) «* It provides the communication, synchronization, and global naming mechanisms required to efficiently support fine-grain, concurrent. programming models, @ It extends a conventional microprocessor instruction set architecture with instructions to support parallel processing, « It provides hardware support for end-to-end message delivery including for- matting, injection, delivery, buffer allocation, buffering, and task scheduling. ‘ It supports a broad range of parallel programming models, including shared- memory, data-parallel, dataflow, actor, and explicit message-passing, by providing low-overhead primitive mechanism for communication, synchronization, and naming, Its communication mechanisms permit a user-level task on one node to send a message to any other node in a 4096-node max chine in less than 2 js. (b) All messages route first in the X-dimension, then Y, then Z. (c) The AAU performs all functions associated with memory addressing. It contains the address and ID register to support naming and relocation. It protects memory accesses and implements the transiation instructions. It maintains two queues to142 Scalar, Multithreaded, and Dataflow Architectures buffer incoming messages and schedule the associated tasks. (d) See Example 9.4. Problem 9.9 (a) In the VEST system, networks with many dimensions require more and longer wires than low-dimensional networks. Thus, high-dimensional networks cost more and Tun more slowly than low-dimensional networks. Under the assumption of constant, wire bisection, low-dimensional networks have wide channels, and high-dimensional networks have narrow channels. With wormhole routing method, which is used by most of the second- and third-generation multicomputers, the wider channels provide a lower latency, less contention, and higher hot-spot throughput. (b) We can treat the router at each node as the stage, and the iit buffer as the stage latch in a superpipelined functional units, Information is transmitted (processed) from a router (stage) to another. The differences are: Most of the pipelined functional units are synchronously operated. # Pipelined functional units have fixed data flow patterns, but the message passing mechanism may dynamically change its data flow by the routing information in the header fits. Problem 9.10 (a) @ The memory is initially in the home state (uncached), and all cacke copies are invalid, Sharing-list creation begins at the cache where an entry is changed from an invalid to a pending state. When a read-cache transaction is directad from a processor to the memory controller, the memory state changed from uncached to cached and the requested data is returned. ‘The requester’s cache entry state is then changed from a pending state to an only-clean state. Sharing list creation is illustrated in the figure below. Multiple requests can be simultaneously generated, but they are processed sequentially by the memory controller. new new [Leencina] Processors [oniy cian read cached My home | Memory [cached Before After © For subsequent memory access, the memory state is cached, and the cache head of the sharing list has possibly dirty data. As illustrated in the figure below, a new requester (cache A) first directs its read-cache transaction to‘Scalar, Multithreaded, and Dataflow Architectures 143 memory but receives a pointer to cache B instead of the requested data. Processors old [reveieo]. F[ouston Je) [rows aidstt) Je) Memory cached Before After A second cache-to-cache transaction, called prepend, is directed from cache A to cache B, Cache B then sets its backward pointer to point to cache A and returns the requested data. The dashed lines correspond to ‘transactions between a processor and memory or another processor. The solid lines are sharing-list pointers. After the transaction, the inserted cache ‘A becomes the new head, and the old head (cache B) follows cache A in the chain. (b) Compared to backplane bus, chained directory provides a greater bandwidth and better scalability. Its cost can be cheaper since snoopy cache coutrullers are not needed. It allows an invalidation signal to be sent to specific processors instead of broadcasting the signal to all processors. However, it may take a longer time for the signal to reach all the processors involved. ‘The advantage of a chained directory compared to a full-map directory is the saving in space needed to store directory information. Suppose there are P processors in the system and the number of memory blocks is M. Typically M is proportional to P. If a fulkmnap directory is used, a presence bit is needed to indicate whether 2 processor has a particular memory block in its cache. The total number of presence bits is O(MP) = O(P?). On the other hand, if chained directory is used, each block only needs to maintain a pointer to the first processor that caches the block. Each pointer takes O(log P) bits, thus a total of O(M log P) = O(Plog P) bits is needed. This saving also makes SCI more scalable than full-nap directory. Compared to full-map directory, chained directory has two disadvantages. First, the time it takes to send an invalidation signal to all processors that have a cache copy of a memory block may be longer when the number of processors is large. The reason is that in full map, the invalidation signal can be sent to all such processors in parallel, whereas with the use of chained directory, the invalidation is propagated through the chain, which can take a long time. Second, the protocol! design may be more complicated. Because of the longer delay, race conditions are more likely to arise, which have to be taken into account in protocot design. Problem 9.21 Different context-switch poticies affect the average busy time R.144 Scalar, Multithreaded, and Dataflow Architectures (a) In owitch on cache miss policy, memory access with long latency will be involved. Thus, context switch makes good use of the idle time. ‘The overhead is the time taken to determine whether a cache hit or miss has occurred. If switch on load scheme is used, the aforementioned overhead is eliminated. But R is likely to be smaller than switch on cache miss. Switch on every instruction interleaves instructions from different contexts on a cycle-by-cycle basis, irrespective of whether a load operation is encountered. ‘The independence among successive instructions can hide pipeline dependences, hence improving pipelined execution efficiency. On the other hand, locality may bbe jeopardized, which results ina lower cache hit ratio, A scheme which interleaves contexts in blocks of instructions improves the locality of references. But the degree of dependence among successive instructions is higher than that in switch on every instruction scheme. The determination of a suitable block size can be difficult. (b) Each context-switch scheme has its merits and drawbacks. Thus, more research is needed to determine which one provides the best performance. The choice will depend on other performance parameters as well. For instance, context-switch cost and memory access latency are likely to influence the decision. ‘The behav- jor of programs should also be taken into account. Both analytical analysis and simulation will be useful in assessing the performance of different models. Problem 9.12 Dash uses a distributed shared memory architecture which combines the ease of using shared memory and scalability of message-passing systems. (a) Dash uses an invalidation-based cache coherence protocol. See Fig. 7.15 in the text for the cache states and the events causing transitions from one state to another. (b) A home cluster maintains the directory and physical memory location of a memory address. Each entry in the directory corresponds to a memory block. It has a presence bit for each processor cache. In addition, a state bit indicates whether the block is uncached, shared, or dirty. A memory access is satisfied by going through the hierarchy of processor cache, local cluster, home cluster, and finally remote clusters. The directory information makes it, possible to send invalidation signals to those processors which have a copy of a memory block instead of broadcasting to all processors. It also helps decide when a memory block needs to be written back to main memory. {c) See Example 9.5 in the text. (d) See Example 9.5 in the text. Problem 9.13 (a) The KSR-1 offers a singlelevel memory, called ALLCACHE. This ALLCACHE, design represents the confiuence of cache and shared virtual memory concepts that exploit locality required by scalable distributed computing. Each local cache has capacity of 32 Mbytes (2° bytes). The global virtual address space has 2*° bytes.Scalar, Multithreaded, and Dataflow Architectures 145 (b) With ALLCACHE, an address becomes a name, and this name automatically migrates throughout the system and is associated with a processor in a cachelike fashion as needed. Copies of a given cell are nade by the hardware and sent to other nodes to reduce access time. A processor can prefetch data into a local cache and poststore data for other cells. The hardware is designed to exploit spatial and temporal locality. When a processor writes to an address, all cells are updated and memory coherence is maintained. (c) Both systems have distributed main memory, scalable interconnection networks, and directory-based coherence scheme. Dash allows pages to be migrated among processors. DDM has a COMA architecture, which replaces the private memory attached to each node by a huge secondary/tertiary cache, called attraction memory. Data blocks can be migrated or duplicated among processors. Processing nodes in both Dash and DDM are clusters of multiple processors. Dash uses a wormhole- routed mesh interconnect, whereas DDM uses a hierarchy of buses. Refer to the papers for more details. Problem 9.14 (a) Some of the design goals of the Tera architecture are listed below: * Very high-speed implementations — The architecture should have a short clock period and be scalable to many processors. + Applicability to a wide spectrum of problems — Programs that do not vectorize well due to a preponderance of scalar operations or too frequent conditional branches should execute efficiently as long as there is sufficient parallelism to keep the processors busy. Ease of compiler itnplementation — The design of the instruction set should simplify the task of the compiler in generating code that can exploit parallelisin efficiently (b) The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16 x 16x 16 toroidal mesh; i.e., the mesh “wraps around” in all three dimensions. Of the 4096 nodes, 1280 are attached to the resources comprising 256 cache units and 256 1/O processors. ‘The 2816 remaining nodes do not have resources attached but still provide message bandwidth. ‘To increase node performance, some of the links are missing. If the three directions are named x, y, and 2, then «links and y-links are missing on alternate ‘slayers. This reduces the node degree from 6 to 4, or from 7 to 5, counting the resource link. In spite of its missing links, the bandwidth of the network is very large.1468 Scalar, Multithreaded, and Dataflow Architectures i Stream Status Word (SSW) +32 bit PC (Program Counter) * Modes (e.., rounding, lookahead disable) + Trap disable mask (eg, data alignment, overflow) + Condition codes (last four emitted) No synchronization bits on RO-R3T Target Registers (T0-T7) look like SSWs (c) Bach processor in ¢ Tera computer can execute multiple instruction streams (threads) simultaneously. In the current implementation, as few as I or as many as 128 program counters may be active at once. On every tick of the clock, the processor logic selects a thread that is ready to execute and allows it to issue its next instruction. Since instruction interpretation is completely pipelined by the processor and by the network and memories as well, a new instruction from a different thread may be issued in each tick without interfering with its predecessors. ‘When an instruction finishes, the thread to which it belongs becomes ready to execute the next instruction. As long as there are enough threads in the processor so that the average instruction latency is filled with instructions from other threads, the processor is being fully utilized. ‘Thus, it is only necessary to have enough threads to hide the expected latency (perhaps 70 ticks on average); once latency is hidden, the processor is running at peak performance and additional threads do not speed the result. Ifa thread were not allowed to issue its next instruction until the previous instruction is completed, then approximately 70 different threads would be required on each processor to hide the expected latency. The lookahead described later allows threads to issue multiple instructions in parallel, thereby reducing the number of threads needed to achieve peak performance. {d) Bach thread has the following states associated with it +» One 64-bit stroam status word (SSW); © Thirty-two 64-bit general-purpose registers (RO-RS1);Scalar, Multithreaded, and Dataflow Architectures M47 Eight 64-bit target registers (T0-T7). Context switching is so rapid that the processor has no time to swap the processor-resident thread state, Instead, it has 128 of everything, i.o., 128 SSWs, 4096 general-purpose registers, and 1024 target registers. It is appropriate to compare these registers in both quantity and fimetion to vector registers or words of caches in other architectures. In all three cases, the objective is to improve locality and avoid reloading data. Program addresses are 32 bits in length. Bach thread's current program counter is located in the lower half of its SSW. The upper half describes vari- ‘ous modes (e.g., floating-point rounding, lookahead disable), the trap disable mask (eg., data alignment, foating overflow), and the four most recently generated condition codes, ‘Most operations have a TEST variant which emits a condition code, and branch operations can examine any subset of the last four condition codes emitted and branch appropriately. Also associated with cach thread are thirty-two. 64-bit general-purpose registers, Register RO is special in that it reads as 0 and output to it is discarded. Otherwise, all general-purpose registers are identical ‘The target registers are used as branch targets. ‘The format of the target registers is identical to that of the SSW, though most control transfer operations use only the low 32 bits to determine a new PC. Separating the determination of the branch target address from the decision to branch allows the hardware to prefetch instructions at the branch targets, thus avoiding delay when the branch decision is made. Using target registers also makes branch operations smaller, resulting in tighter loops. There are also skip operations which obviate the need to set targets for short forward branches. One target register (T0) points to the trap handler which is nominally an unprivileged program. When a trap occurs, the effect is as if a coroutine call to a TO had been executed. This makes trap handling extremely lightweight and independent of the operating system. Trap handlers can be changed by the user to achieve specific trap capabilities and priorities without loss of efficiency. (¢) The Tera architecture uses a new technique called explicit-dependence lookahead. Each instruction contains a 3-bit lookahead field that explicitly specifies how many instructions from this thread will be issued before encountering an instruction that depends on the current one. Since seven is the maximum possible lookahead value, at most 8 instructions and 24 operations can be concurrently executing from each thread. A thread is ready to issue a new instruction when ail instructions with lookahead values referring to the new instruction have completed. Thus, if each thread maintains a lookahead of seven, then nine threads are noeded to hide 72 ticks of latency, (1) The Tera uses multiple contexts to hide latency. The machine performs a context switch every clock ¢ycle. Both pipeline latency (eight cycles) and memory latency are hidden in the HEP/Tera approach. ‘The major focus is on latency tolerance148 Scalar, Multithreaded, and Dataflow Architectures rather than latency reduction. With 128 contexts per processor, a large number (2K) of registers must be shared finely between threads. The thread creation must be very cheap (a few dock cycles). Tagged memory and registers with full/empty bits are used for synchronization. As long as there is plenty of parallelism in user progranis to hide latency and plenty of compiler support, the performance is potentially very high. However, these Tera advantages may be embedded in a number of potential drawbacks. The performance can be bad for limited parallelism, as in the case of single-context environments. On the other hand, a large number of contexts (threads) require lots of registers and other hardware resources which in turn im- plies higher cost and complexity. Finally, the limited focus on latency reduction and cacheing entails a high degree of parallelism and a high memory bandwidth in order to hide latency; both tend to drive up the cost in building the machine. Problem 9.15, (a) Static dataflow computers do not allow more than one token to reside on the same arc of a dataflow graph. ‘The firing rule for an operator node is that all the input tokens are present and there is no token on the output arc(s). The implementation Tequires extensive acknowledge signals, Dynamic dataflow computers allow more than one token to be on the same arc simultaneously. Bach token is associated with a tag. When tokens of identical tags are present on all the input arcs of an. operator, it is fired. ~B. + JB ~ TAG, 2A; (b) The root of Aix? + Bir; + C; = 0 can be computed as 2; ‘The dataflow graph is shown in the following diagram for any i ‘There are 11 nodes, each with two input arcs and one outpnt arc. ‘The output tokens of the nodes are labeled a through i.Scalar, Multithreaded, and Dataflow Architectures 149 (c) Partition of the computations among the PEs is shown in the above diagram. The partition is not unique for achieving a balanced load among processors. Suppose each computation takes one clock cycle. Three of the PEs execute three computations and the fourth one (PE2}executes two computations. The average latency of one iteration is 3 when the computation reaches the steady state. The schedule is, shown in the following table with the subscript of each output token corresponding to the loop index i PEO PEL PE2 [| PES a ti @ 4 a i ee % en fy te an i i fe % fi ae 2 es b Te by i de i a Ter as 28 ee ds Problem 9.16 (a) For each y(i), m multiplications and m —1 additions need to be performed, giving a total of mn multiplications and (m — 1)n additions. (b) ‘The computations of individual elements of y are independent of each other. Hence, the computations can be partitioned as follows: y(i) are computed by processor 0 for i = 0.63, by processor 1 for i = 64..127, by processor 2 for i = 128..191, and by processor 3 for ¢ = 192.255. (c) Using the above partition, each processor will need to have 67 elements of z and all four elements of w for circular convolution. For instance, processor 0 needs 2(0) through 2(63) and 2(258) through (255). Note that the extra 3 elements to be fetched into each processor reside in memory modules 29, 30, and 31, respectively. ‘This is a result of how the vector elements are stored and the naturo of circular convolution. ‘Therefore, proper interleaving is required to avoid conflicts. This interleaving is facilitated by the assumption of enough registers in each processor so that memory access and arithmetic operations can be performed in separate phases. The ith elements of x and y are stored in memory module j = i mod 32. Flements of vector w are stored in a similar fashion in memory modules 0 through 3. With the storage scheme, each memory module stores 8 elements of vector z. Lo” (Qj)150 Scalar, Multithreaded, and Dataflow Architectures Modules 0 through 28 will be accessed 8 times each. Modules 29 through 31 will be accessed 12 times due to access contentions described in the above. To fetch w into all the four processors takes 4 cycles. ‘Fhe access of 67 elements of x into each of the four processors takes another 67 cycles. ‘Computations of y can be carried out concurrently in all four processors, each responsible for 64 clements, resulting in a total of 4 x 6443 x 64 = 448 cycles. Finally, the elements of y are stored back to the memory at a rate of 4 elements per cycle, taking 64 cycles. Therefore, the total parallel computing time is ty = 4467 + 448 + 64 = 583. cycles. (d) Ha single processor is used, the following steps are required: Fetch w from memory in 1 cycle, Fetch x from memory in 64 cycles, Compute y in 256 x (4+ 3) = 1792 cycles, Store y into memory in 64 cycles. Thus, the execution time by a single processor is ty = 1464417924 64 = 1921 cycles. ‘The speedup using 4 processors over a single processor is, 4 ftq = 1021/583 = 3.3. Problem 9.17 (a) A fine-grain processor typically has a small amount of memory associated with it. Tn the construction of large-scale computer systems, fine-grain processors match better with fine-grain software parallelism and have cost advantage over medium: grain processors. (b) In a uniprocessor system, there is only a single address space. Many programs have been developed based on this concept. A single global address space offers continuity of the perception and can simplify the program development process as the programmer does not need to worry about the message-passing mechanisms on individual machines. It also simplifies data partitioning and dynamic load balancing and improves the portability of programs across machines with different architectures. (c) Because of high synchronization cost, coarse-grain parallelism necessitates the allocation of a large chunk of computations, such as several iterations, to each processor. As a result, low-level parallelism such as individual iteration or instruction is not fully exploited. From scalability point of view, as the number of processors is increased, it is important to take advantage of such Jow-level parallelism in order to reduce solution time and improve processor utilization. The consideration favors the use of fine-grain parallelisin over medfum- or coarse-grain parallelism.Chapter 10 Parallel Models, Languages and Compilers Problem 10.1 (a) In synchronous message-passing, the sender and the receiver must be synchronized in time and space. In other words, a communfcaiton channel must be established before message passing can convene, much like communication over a telephone line. No buffering is needed on the channel. In the case of asynchronous message-passing, it is not necessary to coordinate the sender and receiver. A message is delivered to the channel and may be stored in the buffers on the channel ot a global mailbox before arriving at the sender. In this scheme, an acknowledge from the receiver is needed to signal the correct receipt of a message (b) In synchronous message passing, if a channel cannot be established between two communicating processes, the message will be blocked, which in turn will block the execution of the processes involved. On the other hand, in asynchronous message passing, as long as the channel buffer is sufficiently large, the transmission of messages and execution of processes will not be blocked. Therefore, it offers better resource utilization and potentially shorter communication delays {c) Rendezvous is a scheme adopted in Ada for synchronous message passing. In this scheme, a sender or receiver arriving earlier at the rendezvous has to wait for the arrival of the other before they can proceed to exchange messages. {d) In a name-addressing scheme, a sender or receiver process is identified by the process ID and the node in which it resides. This convention is adopted by Ada. In a channel-addressing scheme, a path is established between a sender and a receiver process by specifying the channels connecting the nodes in which the processes reside. 151152 Parallel Models, Languages and. Compilers (e) In asynchronous message passing, the sender and receiver processes are effectively uncoupled from each other via the use of mediaries such as channel buffers or a global mailbox. Through this uncoupling, both processes can execute more freely, leaving the transmission of messages to be handled by’ the mechanisms provided by the communication channels. (4) Both interrupt and lost messages can occur in asynchronous message-passing systems. An interrupt message differs from a regular message in that it has to be handled immediately by the receiver process, even though the receiver may not expect to receive it. After it is serviced, the interrupted process can resume its execution, Lost messages are those directed to a wrong process or node and eventually are lost. It is important to design effective detection and debugging facilities to redirect lost messages to the correct receivers to ensure smooth program execution. Problem 10.2 The idea is to add fork-join primitives into the code to allow parallel execution of the program. Different concurrent Lisp languages, such as Multilisp, Qlisp. Symmetric Lisp, and Connection Machine Lisp, have different syntax and semantics. ‘A. Lisp-like code seginent based on concurrent object-oriented model can be found in {Aghag0}. In practice, Lisp language available on an accessible machine should be used to write a program to carry out the computations. Performance data can then be collected and analyzed. Problem 10.3 (2) C* is a data parallel language developed by Thinking Machines. It provides high- level constructs for parallel computing on SIMD machines. Quinn and Hatcher described compiling and various optimization techniques to convert a program written in C* to one in C for execution on SPMD or MIMD machines. Four issues were addressed in their paper: «= how to infer message-passing requirement? © how to support synchronization requirement? ‘> how to emulate @ large number of PEs efficiently on a machine without hardware support for virtual processors? © how to minimize message-passing cost? Several methods to deal with these problems were discussed by the authors, includ ing reduction of synchronization and message-passing, For instance, in order to reduce message-passing cost, instruction and data can be replicated on all nodes. ‘Also, data exchange can be carried out in blocks instead of bytes to reduce startup overhead. Their experiments with Gaussian elimination on an nCube 3200 showed that a C program generated from translation of C* code with message optimization ‘was comparable in quality to a hand-coded C program. {b) SIMD mode is synchronous in that all active PEs execute the same operations in a lockstep fashion. Tt is especially suitable for data parallel computations. In SPMDParallel Models, Languages and Compilers 153 mode every PE executes the same program in an asynchronous manner. PEs ‘coordinate with each other at synchronization points but otherwise each PE works at its own pace between those points. Synchronization is achieved by message Passing among processors. Asynchronous algorithms executed in SPMD mode are prone to time-dependent errors, In contrast, SIMD execution has simple flow control, and the computation results are deterministic regardless of the number of PEs. However, not all applications are suitable for execution in SIMD mode. (c) See the optimization described in the paper for Gaussian elimination and conduct similar optimization for FFT after an analysis of the program flow. Problem 10.4 (a) Multiprogramming refers to the interleaved execution of multiple indepondent programs on a uniprocessor or multiprocessor system through time sharing. Its use is intended to overlap CPU and I/O operations among programs ta improve resource utilization. (b) Multiprocessing is multiprogramming implemented at process level on a multiprocessor. If interprocessor communications are handled at instruction Jevel, the mode of operation is MIMD multiprocessing and exploits fine-grain parallelism. (c) Multiprocessing, in which interprocessor communication takes place at program, procedural, or subroutine level, is characterized as operating in MPMD mode. In this mode, coarse-grain parallelism is exploited, (4) When a single program is divided into several interrelated tasks which can be executed concurrently on a multiprocessor, the mode of operation is referred to as, multitasking. (e) Multithreading is a refinement of multitasking and multiprocessing concept. A task can create multiple threads which are executed on one or more processors at the same time. Since threads are lightweight processes with minimum state and register information, context switching is much faster than in multiprogramming. (1) Program partitioning refers to the decomposition of a large program and data sets into small pieces which can be executed in parallel on multiple processors, Problem 10.5 (a) 1. AG8,1), A(6,9,1), AG,10,1), 4(5,8,2), A(5,9,2), A(5,10,2), A(5,8,3), A(5.8,3), A(5,10,3), A(G5,8,4), A(5,9,4), A(5,10,4), A(5,8,5), A(5,9,5), A(5,10,5} 2, B(3,5), B(3,6), B(3,7), B(3,8}, B(6,5), B(6,6), B(6,7), B(6,8), B(9,5), B(9,6), B(9,7), B(9,8). 3. C(1,3,4), C[2,3,4), C(3,3,4). (b) 1. Yes. The number of elements is the same in each dimension of the source and destination arrays184 Parallel Models, Languages and Compilers 2. No, because the two arrays have different sizes in the first. dimension. 3. No, because the two arrays have different dimensions. 4, Yes. Problem 10.6 (2) Flow dependence between statements 5; and $; in successive iterations of Joop. ‘The distance vector is (0,1), and the direction vector is (=, <). (b) Flow dependence between statements $ and Sp. The distance vector is (0,0), and the direction vector is (=,=) {c) Antidependence between statements S; and Ss in successive -loop. The distance vector is (—1,0), and the direction vector is (>=). Problem 10.7 i 8; + Sr ++ Ss. (b) The vectorized code is as follows: (EN) = BEN) . E(L:N) = C(2:N+1) C(ULN) = A(LN) + B(LN) ‘Note that it is necessary to store the original value of C’in E before C is overwritten. ‘Therefore, the order of statements Sp and S3 in the original loop is reversed in the vector code, It is also permissible to interchange the first two vector statements, since they are independent. Problem 10.8 (a) 1. There is flow dependence on variable A between statements S; and Ss in successive iterations of J-loop. The distance vector is (0,1), and the direction vector is (=, <)- 2, There is flow dependence on variable E between statements Sy and Sj in successive iterations of J-loop. The distance vector is (0,1), and the direction. vector is (=<) 3. There is antidependence on variable C between statements $, and Sz in the same iteration. The distance vector is (0,0), and the direction vector is (=.=). (b) There is no data dependence among different I-loop iterations. Therefore, ‘they, can executed in parallel, The compiler can preschedule the iterations of the Hoop into P processors in contiguous blocks as follows: processor 1 executes iterations 1, 2, ... [N/P]; processor 2 executes iterations [/P] + 1, [N/P) + 2.2 [N/PIsParallel Models, Languages and Compilers 155 Alternatively, every Pth iteration can be assigned to the same processor: processor 1 executes iterations 1, P +1, ... 2P +1; processor 2 executes iterations 2, P+ 2, .., 2P +2; Problem 10.9 (a) The loop can be compiled with the Iioop in vector mode, which will generate stride-1 memory operations. (b) The loop can be compiled for parallelization in the J-loop as follows: Doacross J = 1,N Si: A(LN, J4+1) signal(J) if (J > 1) wait (J-1) So: D(LN,J) = AGN, J) /2 Endacross (LN, J) + CCN, J) ‘The parallel execution is illustrated in the following diagram for the case of two processors. Processor 1 Processor 2 AQ:N2)= BOND +CQ:NA) AGEN) = BC:Na) + COHN) sige signal(2) 7 Ht pann=aanne swait(}) pe D(LN2)= ACN) AUNA)= BUND*CENS) AUN S) = BUENA) + CCLNA) — Signal) ay ——.. an Jet Pj Pitts ---» Pa) t0 (Pry oy Pi-t) Piy Pitty > Bi-1s Bs + {Pin Biss «++ Pn}: ‘The above three transformations can be formulated as elementary matrix operations. See the text for matrix representations and examples. (d) Loop tiling refers to various techniques of breaking iterations into small blocks to obtain coarser granularity which can reduce synchronization overhead and improve data locality. Typically an n-deep loop is converted into a 2n-deep loop, where the inner n loops are determined by the tile size used. (c) Wavefront transformation is a technique to maximize the degree of parallelism in ‘n fully permutable loops with dependences. The idea is to skew the innermost in the nest with respect to each of the other loops and then move the innermost loop to the outermost position. See text for examples. (4) Locality optimization is used to reduce memory access penalties. The idea is to improve the reuse of data once it is brought into a level of the memory hierarchy which is closer to the processor. Such techniques as loop interchange, instruction and data prefetching, and tiling can be used to achieve the goal. (g) Software pipelining is the pipelining of successive iterations of a loop in a source program. It is particularly suited to deep hardware pipelines and can be used with either Doall or Doacross loops. Similar to hardware pipelining, it is desirable to minimize instruction initiation latency. Problem 10.11 (a) In iteration 1, A(Z) is updated by the value of A(I+1). The value of A(I +1) is not updated until the (I + L)st iteration, which has not been executed yet. In general, with forward LCD, the reference to an clement always occurs before its value is updated in a later iteration. This type of operations can be vectorized. In effect, the computations in the loop add a scalar constant 3.14159 to each element of A and then shift them forward by one position, In other words, the loop is equivalent to the following vector code: A(2N+1) + 3.14159 A(LN) = VON) (b) ‘The assignment to A(2) in the second iteration depends on the value assigned to B(2) in the first iteration. ‘The compiler can interchange the statements within the loop so that the assignment to B occurs before the assignment to A, as shown below: Dot=1,N=1Parallel Models, Languages and Compilers 157 B(I+1) = D() + 3.14159 A() = BQ) + C() Enddo ‘The code can be vectorized as follows: B(2:N) = D(LN-2) « 3.14159 AQEN=1) = BOLN=1) +°O(L:N-1) Problem 10.12 (a) This program can be vectorized as follows: ACN) TEMP(I:N) + 3.14159 (b) The code cannot be directly vectorized or parallelized because of the carry-around variables $ and X. To see this, consider the following parallel code: Doall I= 1,N If(A() .LE. 0.0) then S=S+B() + C(i) X= BI) Endif Enddo If all processors are allowed to proceed concurrently, the values of $ and X will be nondeterministic. On the contrary, the serial code will give a definite answer for S and X. However, if intermediate vectors are introduced to store the value of B(I) * C(Z), then some vector or parallel processing can be achieved. This is illustrated in the following code for performing the conditional inner-product operation: D(LN) = 0 where (A(I:N) LE. 0.0) do D(EN) = BON) + C(LN) endwhere See {Wolfe89] for more details. The elements of D can then be summed up in parallel using a binary tree computing structure to obtain S. Alternatively, $ can be obtained by a vector reduction operation as follows: S = suin(D(L:N)) Similarly, the determination of X in the original loop can be vectorized. Let vector P be initialized so that P(J) = I for I= 1..V, and Q is a zero vector. The following voctor code yields the desived result:158 Parallel Models, Languages and Compilers where (A(I:N) LE. 0.0) do Q(LN) = P(L:N) endwhere K = max(Q(1:N)) X= B(K) In the above, maz is a vector reduction function which finds the mazxisum value of a vector. Of course, the performance of the vector code will depend on how fast vectors P and Q can be generated. Typically, P and (initial) Q are created at compile time since their elements are fixed. Then the cost can be amortized over a large number of executions of the vector code. Problem 10.13 Tanenbaum et al. proposed a laycred approach to provide a uniform interface for parallel programming. The approach is insensitive to machine architecture and can be used with multiprocessors or multicomputers. Besides architecture- transparency, the other goal is to maintain a good performance in a distributed shared memory system. The two major components of the system are shared objects and reliable broadcasting. An object is an abstract data type with well-defined operations. For instance, an object can be a data structure with read and write operations. ‘An object that is shared by multiple processes are replicated for each process that needs to access it. When a process performs a read operation on a shared object, it, igs treated as an operation on a private object and can be done locally with proper synchronization. When a write operation is performed on a shared object, the updated value needs to be sent to other processes via the reliable broadcasting mechanism. In general, read operations occur much more frequently than write operations. ‘Therefore, replicating and sharing data objects can be effective. Moreover, the low overhead associated with reliable broadcasting (at most 2 sends for each message) allows the system to scale up in performance. Consult [Tanenbaum92| for more details about the broadcasting protocols and object management schemes.Chapter 11 Parallel Program Development and Environments Problem 11.1 (4) In busy wait, a process waiting for an event remains loaded in the context registers, of a processor and keeps trying to get into a critical section. In slecp wait, a waiting process is removed from the processor and put in a wait queue. Later on, after the event it is waiting for takes place, the suspended process is awaken and rescheduled, (b) In sleep wait, a policy is needed to select one of the suspended processes in the wait queue to be revived. The policy must ensure that all suspended processes in the queue are treated fairly. ‘That is, no process should be suspended for an extraordinary amount of time compared to others. For instance, a first-come-first- served revival policy is a fairness policy. (c) Lock is a mechanism used to implement presynchronization in which a requester Process is required to obtain sole access to an atom (shared writable object) before performing an operation to update it. The purpose is to avoid concurrent updates to an object. (d) Optimistic synchronization (or postsynchronization) allows an atom to be updated before sole access is granted to a requester process. This is achieved in two steps. First, the requester modifies a local version of the object. Second, it checks to see if there has been a concurrent update to the global version. If so, the local update is aborted; otherwise, the global version is updated. (e) In server synchronization, each atom is associated with an update server. Any pro- ‘cess that wishes to perform an atomic operation on an atom has to do so by sending a request to the server. This approach is often adopted in object-oriented systems to provide data encapsulation. The corresponding synchronization environment is often more user friendly as the user does not need to know or worry about the ase160 Parallel Program Development and Environments implementation details of mutual exclusion mechanisms for synchronization. This strategy is adopted in monitors for synchronization and can be implemented as server daemons. Problem 11.2 (a) Lock is a mechanism used to ensure sole access to a critical section. If a spin lock is used, a process waiting to enter the critical section will keep on trying until it {gains access. In the case of a suspend lock, once a process is denied access to the critical section, it is suspended and put in a queue. Suspended processes will be activated one by one when access to the critical section is allowed. Suspend lock allows more efficient use of the processor than spin lock but care must be taken to guard against indefinite waiting for some processes. (b) Dekker’s algorithm for synchronization ensures mutual exclusion and avoids unnecessary waiting. To accomplish this, each process uses a flag to indicate whether it desires to enter the critical section. To achieve mutual exclusion, each process checks whether there is another process in the critical section. If so, it backs off. The following algorithm is described in [Silberschatz8s}. It uses an array flag(0 : n — 1) to indicate the status of the processes. Each element of the array can assume three values: idle, in, out. A global variable tum is used to select a process between 0 and n ~ 1. Initially, all the elements of the flag array is set to idle and turn can assume any valid value. An auxiliary integer variable j is also used in the algorithm. In this algorithm, each process #, 0 <7 n; turn = critical section fiag(i) = idle; j= i41 moda; while (j # i && flag(i) turn = j; exit {critical section } in) j = j+1 mod m5 =}41 mod a;Parallel Program Development and Environments 161 Note that initially it is possible for soveral processes to set their flags to in at the same time. If that happens, all of these processes will be forced to reset their flags to out. On the second try, only one of them will be able to enter and set its flag to fry, the others will be blocked and spin wait. When an incumbent process exits the critical section, it selects the next process to enter the critical section in an orderly manner. This guarantee that any process wishing to enter the critical section will be able to do so after at most nm — 2 tries. (c) The generalized Dekker's algorithm can be implemented using Test&Set. Each process is associated with a flag, which can be examined and/or changed by all the processes. In addition, each process has a local variable key, which can only be updated by the owing process. A global variable lock is used to guard the entrance to a critical section. Initially, all the flags are set to false. Each process i wishing to enter the critical section executes the following code, also adapted from (Silberschatz88} fiag(i) = true; true; while (flag(i) && key) key fiag(i) = false; critical section j=i+1modn; while (j i && flag(i) Test&Set(lock); End false) j = j+1 mod n; End if (| == 1) lock = false; else flag(i) = false; endif, ‘This code uses atomic Test&Set operation to ensure mutual exclusion of the critical section. ‘The method used to select the next process is similar to the software algorithm in (b). Problem 11.3 (a) A binary semaphore is a variable which can assume value 1 or 0. It has two associated operations P and V, corresponding to wait and signal. A process wishing to enter a critical section first performs a P operation to see if another process is already in the critical section. If that is the case, it is blocked. When a. process leaves the critical section, it performs a V operation, thus allowing a waiting process to be awaken and enter the critical section, A binary semaphore is initialized to 1 to allow the first process to enter the critical section without. waiting. It can be implemented in hardware using atomic operations such as Test&Set. (b) Monitor is a high-level construct that encapsulates shared variables and associated procedures into a module. A monitor consists of (1) local variables, (2) procedures that manipulate local variables, global variables, and parameters passed from calling processes, and (3) initialization of local variables. Only the values of local variables can be changed by the procedures. Also, only one process is allowed to162 Parallel Program Development and Environments be in the monitor at a time, Thus mutual exclusion mechanism is embedded in the construct. Monitor relieves individual processes of the need to take care of mutual ‘exclusion in the code and reduces the possibility of errors. For instance, in the use of binary semaphore, if the semaphore is not initialized to 1, processes that wish to enter the critical section will hang up indefinitely. With the use of monitors, the debugging process is simplified by getting rid of inadvertent mistakes. Problem 11.4 Let the philosophers and forks both be numbered 0 to 4. The'fork to the right of philosopher i is fork ¢ and the one to his left is fork (i ~ 1) mod 5. (a) Let forks(0:4) be the semaphores associated. with the forks and all its elements are initialized to-1 at the beginning. In the fetch protocol, an even-numbered philosopher first picks up the fork to his right and then the one to his left. An odd-numbered philosopher first picks up the fork to his left and then the one to his right. In the release protocol, both forks are put down in a random order. Fetch protocol if (i mod 2 == 0) then P(fork(i)); P(forks((i=1) mod 5)); else P(forks((i-1) mod 5)}; P(forks(i)); endif Release protocol V(forks((i-1) mod 5)}; V(fork(i)); ‘This protocol allows a philosopher to hold a fork while waiting for the other. Deadlocks are avoided by breaking circular waits among the philosophers which is a necessary condition for deadlock to occur. Based on the protocol, at least: ‘one philosopher will be able to eat at any moment, Moreover, a philosopher will pick up the first fork as soon as it becomes available instead of waiting until both forks on his sides are available. This prevents conspiracy between two philosophers to starve a third philosopher seated between them. ‘Therefore, starvation is also avoided. (b) The above fetch and release protocols can be implemented using moniter as follows: Monitor dining-philosophers forks(0:4): condition; procedure fetch(i) begin if (i mod ) thenParallel Program Development and Environments 163 wait(fork(i mod 5)); wait(fork(i-1 mod 5)); else wait(fork(i-1 mod 5)); wait(fork(i mod 5)); endif, end procedure release(i) begin signal(fork(i-1 mod 5)); signal(fork(i mod 5); end Problem 11.5 A set of processes are in a state of deadlock when every process in the set is waiting for resources held by another process in the set. According to the definition, we know that the four conditions — hold and wait, no preemption, mutual exclusion, and circular waiting —- must hold at the same time to cause a deadlock. If any of the conditions is false, then deadlock can be prevented. For example, if resource is sharable by more than one process or can be preempted, then there is no need to wait for the resource. Circular waiting is implied in the definition of deadlock. Finally, if a Process does not hold resources while waiting for others, these resources can be used by other processes, thereby breaking the stalemate situation. When all the four conditions hold simultaneously, a deadlock situation will potentially occur. But a deadiock can often be averted by properly revising the resource allocation diagram to eliminate circular waiting, Deadlock prevention refers to the use of suitable protocols to ensure that at least one of the four necessary conditions for deadlock will not hold and thus the occurrence of deadlock is prevented. Deadlock avoidance refers to the management of resources so that situations that may lead to deadlock will be avoided. Usually it is achieved by dynamically keeping track of the resources available, allocated, and requested. The operating system elesely monitors the usage of resources to avoid deadlocks. Deadlock detection is a systematic approach for detecting whether a deadlock situation is present. When no deadlock prevention or avoidance measure is employed, deadlocks may occur and need to be detected so that a deadlock recovery algorithm can be invoked. When a deadlock is detected, a deadlock recovery strategy is used to break it. Two options are often adopted. One is to kill one or more of the deadlocked processes to remove the circular waiting, The other is to preempt some of the resources held by one or more of the deadlocked processes.164 Parallel Program Development and Environments Problem 11.6 fa) © B and D do not cause deadlock because only after B releases Sy can D claim 8. If A is executed before B, there will be no deadtock between A and D, either. But if B is executed before A, then A and D can enter a deadlock, with D holding S; and S, while waiting for S, and A holding $1 while waiting for S. © Cand E-can be in deadlock. After C gets Sp (P(S2)) and E gets Ss (P(S3)), C claims $3 and E claims $2 which can never be satisfied. (b) If C and E are deadlocked, A, B, and D will be blocked indefinitely. If A and D. are deadlocked, C and E will be blocked indefinitely. {c} It depends on the race conditions. For instance, If C (or E) can secure both Sp and Sp before E (or C), it will have all the needed resources. After it finishes execution, both resources are released so that E (or C) can proceed. Thus, deadlock is avoided. Similarly, the deadlock between A and D also depends on race condition. (d} The deadlock between C and E can be prevented by either of the following two options which alter the resource usage in C and E: in C: PSs); P(S.); .. or in E: P(S2); P(Ss); - ‘The resulting resource allocation graphs are shown below; there is no circular wait between C and E now.Parallel Program Development and Environments 165 Problem 11.7 (a) Suppose on the disk there aren cylinders numbered 0 through n — 1 starting from the innermost one. An “elevator” algorithm is used in the scheduling. The idea is to continue sweeping in inbound or outbound direction until all requests in that direction have been serviced. Then the sweeping direction is reversed. For details, see [Bic88] When a request is made and the disk head is busy, the request is put in either of two queues, one (insweep) corresponds to inward movement and the other (outsweep} to outward movement of the disk head. The queued requests are served according to the position of the destination cylinder. The scheduler can be implemented by a monitor with conditional wait. See {Silberschatz88]. If a request is put in the outsweep queue, the distance between the destination. and innermost cylinders (dest) is stared with the request. If a request is put in the insweep queue, the distance between the destination and outermost cylinders (n — dest) is stored. The requests are then serviced in the order determined by the number: the smaller the number, the earlier a request is serviced. Clearly, the motivation for the policy is to reduce the movement of the disk head. The following monitor implementation is adapted from [Bic88}. Monitor disk-scheduler type direction = (in, out); dest, pos: integer; dir: direction; busy: boolean; incount, outcount: integer; insweep, outswesp: condition; procedure request (dest); begin if busy if (pos < dest) || (pos == dest &e&e dir ‘ontsweep.wait(dest); outcount = outcount + 1; else insweep.wait(a ~ dest); incount = incount + 1; endif else busy = true; pos = dest; endif end procedure release; begin busy = false;166 Parallel Program Developinent and Environments if dir == out if outcount > 0 ‘outcount = outcount — 1; coutsweep.signal; else dir = in; incount = incount ~ insweep signal; endif else if incount > 0 jincount, = incount ~ insweep signal; else dir = out; outcount = outcount outsweep signal: endif endif end begin dir = in; pos = n—1; busy = false; incount = 0; outcount = 0; end In the above program, the syntax of the signal and wait instructions is slightly changed to accommodate the priority parameter. (b) A user process can access data on the disk by the following sequence: request (cytmam); call driver procedure to transfer the data release ‘The cylnum indicates the location on the disk where the requested data resides Tt can be generated by the file server from user-specified information Problem 11.8 A monitor for a barrier counter can be specified as follows Monitor batrier-counter var counter: integer; flag: condition; procedure block(n) begin counter = counter + 1; if (counter == n) thenParallel Program Development and Environments 167 begin for (i= Lyi ) POSIX (Portable operating system interface for UNIX) is an attempt to stan- dardize operating systems so that applications conforming to the POSIX standard are portable from one platform to another. In 1985, IEEE defined POSIX standards with FIPS 151-1, It was declared by National Institute of Standards and ‘Technology to be the standard interface for government open systems in 1990, Many vendors have subsequently come up with operating systems that comply with POSIX. OSF/i compliance with POSIX includes shells, real-time computing, security facilities, transparent file-access support, protoco!-independent interpro- coss communication, etc. (c) Program development environment contains a set of tools, induding editors, compilers, linker, and debugger based on packages developed by Free Software Foun- dations. Major UNIX shells are also supported, OSF/I environment supports application program construction through a layered approach, with applications on top of user libraries and system libraries, which in turn are supported by OS kernels, Shared libraries reduce space requirement, improve performance, and Tower developing and debugging cost. Separate compilation and dynamic linking, helps modular development of application programs. Position-independent codeUNIX, Mach and OSP/1 for Parallel! Computers 181 placement also improves performance, among other benefits. Problem 12.12 (a) A Pthread is a thread as defined in POSIX standard, Each thread has a single sequential line of control and is intended to carry out a small self-contained job. Motivations for the use of Pthread are enumerated below: ‘+ Use of Pthreads enables cross-development of multiprocessor programs on ‘a uniprocessor system or different platforms. * A server task can spawn several threads to serve multiple requests. Doing 80 improves resource utilization with a light overhead. While a thread is blocked, others can be running. On the system with multiple processors, the requésts can be services concurrently. + Independent threads can be executing in different states. Multiple threads allow computation, communication, aud 1/0 activities to be overlapped. ‘+ Multiple threads allow asynchronous events to be handled more efficiently by preventing inconvenient interrupts and avoiding complex flow control. (b) The database may be shared among several programs. A user program wishing to retrieve/update the database send a request to the server through a communication channel. The server then spawns a thread to serve the request. Since there might be several threads trying to access the database, proper synchronization is needed to prevent simultaneous updates to the data. This is provided by a global lock db.muter which ensures that operations on the database are performed in a critical section. The lock can be envisioned as a semaphore, initialized to 1 at the beginning. Then Pthread.mutex lock and Pthread mutex-unlock can be viewed as P and V operations, respectively. See Chapter 11 for more details Problem 12.13 (a) LINPACK is a package developed by Jack Dongarra, Jim Bunch, Clove Moler and Pete Stewart for solving linear equations and linear least squares problems. See [Dongarra79] for a more detailed description of the package and its usage. LINPACK has been widely used as a benchmark to determine the performance of various computer systems. See {Dongarra92] Tt can deal with linear systems whose matrices are general, banded, symmetric indefinite, symmetric positive definite, triangular, and tridiagonal square. It uses Gaussian elimination with pivoting and Cholesky factorization (for symmetric positive definite matrices) to decompose a matrix. In addition, the package computes the QR (by Householder transform) and singular value decompositions of rectan- gular matzices and applies them to least squares problems. For a description of the pertinent algorithms, please consult texts on numerical analysis or matrix algebra such as (Dahlquist 74] and (Golubs9}.182 UNIX, Mach and OSF/1 for Parallel Computers (b) Most machines provide vectorization and/or concurrentization support based on extensive dependence analysis. Other optimization techniques such as loop inter change may also be implemented. They also allow user interaction to optionally enable or disable such optimizations. Please check the manuals of machines accessible locally. (c) Parallel 1/0 is important to the performance when the data set is large. Without efficient support, I/O can become a bottleneck and degrade overall performance, Parallel I/O is also desirable to support real-time monitoring of program activities for performance tuning or debugging purpose. OS should also support effective ‘program partitioning, scheduling, and synchronization for the parallel execution of LINPACK programs, Problem 42.14 The degree of compiler support provided by different. machines may vary widely. For instance, some systems use very primitive processors which may not be able to perform vector operations, while other systems may have sophisticated processors capable of efficient vector processing. Concurrency support is typically provided through a library of system calls, which manages message passing and other activities. System calls can be linked with user programs at compilation/linkage time. Parallel I/O support is essential so that code and data can be quickly distributed to individual nodes and the results be sent back to the host. Dynamic load balancing provided by the OS will be valuable to the efficient utilization of system resources, especially when the matrix is not regularly structured. Check relevant manuals for more detailed information. Problem 12.15, (a) If the conservative policy is used, at most. 20/4 = 5 processes can be active simultaneously. Since one of the drives allocated to each process can be idle most of the time, at most 5 drives will be idle at a time. Ta the best case, none of the drives will be idle. (b) To improve the drive utilization, each process can be allocated three tape drives. ‘The fourth one will be allocated on demand. In this policy, at most [20/3} = 6 processes can be act simultaneously. The minimum number of idle drives is 0 and the maximum is 2.Bibliography [Adam74] T. L. Adam, K. M. Chandy, and J. R. Dickson, ‘“A Comparison of List Schedules for Parallel Processing Systems”, Commun. ACM, 17(12):685-690, 1974. [Agha90] G. Agha, “Concurrent Object-Oriented Programming”, Commun. ACM, 33(9):125-141, Sept. 1990. {Archibald86] J. Archibald and J. L, Baer. “Cache Coherence Protocols: Evalua- tion Using a Multiprocessor Simulation Model”, ACM Trans. Computer Systems, 4(4):273-298, Nov. 1986. {Berntsen90] J. Berntsen, “Communication-Efficient Matrix Multiplication on Hyper- cubes”, Parallel Computing, pp. 335-342, 1990. [Bic88] L. Bic and A. C. Shaw, The Logical Design of Operating Systems, 2ed., Prentice- Hall, Englewood Cliffs, NJ, 1988, {Connon69] LE. Cannon, A Cellular Computer to Implement the Kalman Filter Al- gorithm, Ph.D. thesis, Montana State University, 1968. {Caswell90] D. Caswell and D. Black, “Implementing a Mach Debugger for Multithread Applications”, Proc. Winter 1990 USENIX Conf., Washington, DC, Jan. 1990, [Chan86] T-F. Chan and Y. Saad, “Multigrid Algorithms on the Hypercube Multipro- cessor", IEEE Trans. Computers, 35(11):969-977, 1986. (Dablquist74] G. Dahlquist and A. Bjdrck, Numerical Methods, Prentice-Hall, Engle- wood Cliffs, 1974. (Dongarra79] J. J. Dongarra et al., LINPACK: Users’ Guide, SIAM, Philadelphia, 1979 [Dongarra92] J. Dongarra; “Performance of Various Computers Using Standard Linear Equations Software”, Technical report, Computer Science Department, University of Tennessee, Knoxville, TN, 1992 {Dubois88] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence and Event Ordering in Multiprocessors”, IEEE Computer, 21(2), 1988, 183184 BIBLIOGRAPHY [Fox87] G. C. Fox, 8. W. Otto, and A. J. Hey, “Matrix Algorithms ox Hypercube (1): ‘Matrix Multiplication”, Parallel Computing, pp. 17-31, 1987. {Furtney92] M. Fortney, “Parallel Processing at Cray Research, Inc.", in R. H. Petrott (ed.), Software for Parallel Computers, pp. 133-154, Chapman & Hall, 1992 [Gharachosloo9i] K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evalu- ation of Memory Consistency Models for Shared-Memory Multiprocessors”, Proc. Fourth Int, Conf. Arch. Support for Prog. Lang. and OS, 1991 [Golub89] GH. Golub and C-F. Van Loan, Mairie Computations, second ed., The Johns Hopkins University Press, 1989. [Hossfeld89} F. Hossfeld, R. Knecht, and W. E. Nagel, “Multitasking: Experience with Applications on a Cray X-MP", Parallel Computing, 12:259-283, 1989. [Hwang84] K. Hwang and P. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Ilill, New York, 1984. [Hwang] K. Hwang and C. M. Cheng, “Simulated Performance of A RISC-Based Multiprocessor Using Orthogonal Access Memory”, J. Pare. Distri. Computing, 1343-57, 1991 [J23a92 J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley, Reading, MA, 1992. [ohnsson89] S. L. Johnsson and C. T. Ho, “Optimal Broadcasting and Personalized Communication in Hypercubes", IEEE Trans. Computers, 38(9)-1249-1268, Sept. 1989. [Konicekot] J. Konicek et al., “The Organization of the Cedar System”, Proc. Int. Conf. Parallel Processing, pp. volume 1, 49-56, 1991. [Leighton92] F. T. Leighton, Introduction to Parallel Algorithms and Architectures, ‘Morgan-Kaufmann, 1992 [Li89] K. Li and P. Hudak, “Memory Coherence in Shared-Memory Systems", ACM Trans. Computer Systems, pp. 321-359, Nov. 1989. [Mosberger93] D. Mosberger, “Memory Consistency Models”, Operating Systems Re- view, 27(1):18-26, Jan. 1993, {Quinn87] M. J. Quinn, Designing Efficient Algorithms for Parallel Commuters, ‘McGraw-Hill, New York, 1987. (Saavedra90] R. H. Saavedra and D.E. Culler, “Analysis of Multithreaded Architectures for Parallel Computing”, Proc. ACM Symp. Parallel Algorithms ond Architecture, Greece, July 1990. [Silberschatz88] A. Silberschatz and J. Peterson, Operating System Concepts, Alternate Edition, Addison-Wesley, Reading, MA, 1988.BIBLIOGRAPHY 185 [Stone90} H. $. Stone, High-Performance Computer Architecture, Addison-Wesley, Reading, MA, 1990. (Tanenbaum92} A. S, Tanenbaum, M. F, Kaashook, and H. E. Bal, “Parallel Program- gr ming Using Shared Objects and Broadcasting", IEEE Computer, 25(8):10-20, 1992, [Wang89] J. Wang et al., “On the Communication Structures of Hyper-Ring and Hy- ig B percube Multicomputers”, J. Computer Sci, Tech., 4(1), Jan. 1989. [Wolfe89] M. J. Wolfe, “Automatic Vectorization, Data Dependence, and Optimizations for Parallel Computers”, in Hwang and DeGroot (eds.), Parallel Processing for Supercomputing and Artificial Intelligence, Chapter 11, McGraw-Hill, New York, 1989. [¥ang89] Q. Yang, L. N. Bhuyan, and B. Liu, “Analysis and Comparison of Cache Co- herence Protocols for a Packet-Switched Multiprocessor”, IEEE Trans. Computers, 38(8):1143-1153, Aug. 1989. [Young87] M. W. Young, A. Tevanian, R. F. Rashid, D. B. Golub, J. Eppinger, J. Chew, W. Botosky, D. L. Black, and R. Baron, “The Duality of Memory and Communi- cation in the Implementation of a Multiprocessor Operating System”, Proc. 11th ' ACM Symp. Operating System Principles, pp. 63-76, 1987.

Solution Manual
100% (6)
Solution Manual
191 pages
(H. M. Schey) Div, Grad, Curl, and All That Text
No ratings yet
(H. M. Schey) Div, Grad, Curl, and All That Text
175 pages
Diff Eqn Book Jas
No ratings yet
Diff Eqn Book Jas
543 pages
Numerical Analysis - Sivarnamakrishna Das, C. Vijayakumari First, 2014 - Compressed
100% (1)
Numerical Analysis - Sivarnamakrishna Das, C. Vijayakumari First, 2014 - Compressed
385 pages
Wiener-Khinchin Theorem Explained
100% (1)
Wiener-Khinchin Theorem Explained
4 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
Quantum Transport :atom To Transistor, Schrödinger Equation: Examples
No ratings yet
Quantum Transport :atom To Transistor, Schrödinger Equation: Examples
10 pages
Discrete Time Processing of Speech Signals - Proakis
No ratings yet
Discrete Time Processing of Speech Signals - Proakis
470 pages
J.M. Wozencraft, I.M. Jacobs - Principles of Communication Engineering-John Wiley & Sons Inc (1966)
No ratings yet
J.M. Wozencraft, I.M. Jacobs - Principles of Communication Engineering-John Wiley & Sons Inc (1966)
368 pages
Pretzel o - Error-Correcting Codes and Finite Fields Clarendon 1992
No ratings yet
Pretzel o - Error-Correcting Codes and Finite Fields Clarendon 1992
205 pages
Numerical Methods e Balaguruswamy PDF
No ratings yet
Numerical Methods e Balaguruswamy PDF
295 pages
ECT 305 - ADC - Mod4
100% (1)
ECT 305 - ADC - Mod4
105 pages
Legendre Functions
No ratings yet
Legendre Functions
10 pages
Parellel Computing 2024 C - Handout-2
No ratings yet
Parellel Computing 2024 C - Handout-2
3 pages
Vector Calculus-1
No ratings yet
Vector Calculus-1
85 pages
Alan Rolf Mickelson (Auth.) - Physical Optics-Springer US (1992)
No ratings yet
Alan Rolf Mickelson (Auth.) - Physical Optics-Springer US (1992)
357 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
Solutions Manual For Communications Syst
No ratings yet
Solutions Manual For Communications Syst
245 pages
ACA Solution Manual
No ratings yet
ACA Solution Manual
39 pages
Benedetto, Biglieri, Castellani, Digital Transmission Theory, Prentice-Hall, 1987 Appendix
No ratings yet
Benedetto, Biglieri, Castellani, Digital Transmission Theory, Prentice-Hall, 1987 Appendix
56 pages
Problem 3.51
No ratings yet
Problem 3.51
1 page
Quantum Mechanics Homework Solutions
100% (2)
Quantum Mechanics Homework Solutions
3 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
McGill MATH - 133
No ratings yet
McGill MATH - 133
1 page
Solution Manual for Pattern Recognition
No ratings yet
Solution Manual for Pattern Recognition
253 pages
Regular Perturbation Theory Guide
No ratings yet
Regular Perturbation Theory Guide
7 pages
Dips LinearAlgebra HandWrittenNotes 220pages
No ratings yet
Dips LinearAlgebra HandWrittenNotes 220pages
67 pages
Homework 1
No ratings yet
Homework 1
11 pages
The Exponential Form of The Fourier Series
100% (1)
The Exponential Form of The Fourier Series
28 pages
Circuit Terminology: Circuit Analysis & Design by Ulaby, Maharbiz & Furse
No ratings yet
Circuit Terminology: Circuit Analysis & Design by Ulaby, Maharbiz & Furse
34 pages
Vector Differentiation: 3.1 Ordinary Derivatives of Vectors
No ratings yet
Vector Differentiation: 3.1 Ordinary Derivatives of Vectors
5 pages
Mat106 Discrete-Mathematical-Structures TH 1.20 Ac26
No ratings yet
Mat106 Discrete-Mathematical-Structures TH 1.20 Ac26
2 pages
MA526
No ratings yet
MA526
2 pages
K SMHHHDB
No ratings yet
K SMHHHDB
5 pages
Lab - Solution Fundamentals of Electromagnetics With Engineering Applications by Stuart M. Wentworth (Solution
No ratings yet
Lab - Solution Fundamentals of Electromagnetics With Engineering Applications by Stuart M. Wentworth (Solution
44 pages
Unit 5 - Sequence and Series
No ratings yet
Unit 5 - Sequence and Series
27 pages
DSP Audio Equalizer Project Guide
No ratings yet
DSP Audio Equalizer Project Guide
4 pages
Advanced Concepts of Theoretical Physics: Uwe-Jens Wiese
No ratings yet
Advanced Concepts of Theoretical Physics: Uwe-Jens Wiese
145 pages
Models of Computation Solution Manual
No ratings yet
Models of Computation Solution Manual
113 pages
Microwave Engineering Course Guide
No ratings yet
Microwave Engineering Course Guide
5 pages
Density Operator and Applications in Nonlinear and Quantum Optics
No ratings yet
Density Operator and Applications in Nonlinear and Quantum Optics
60 pages
Evaluating The Reproducibility of Multiagent Systems: Alex Vitorino Denise Fonseca Resende
No ratings yet
Evaluating The Reproducibility of Multiagent Systems: Alex Vitorino Denise Fonseca Resende
5 pages
Frobenius Method for ODE Solutions
No ratings yet
Frobenius Method for ODE Solutions
5 pages
Lecture Notes On Discrete-Time Signal Processing
100% (3)
Lecture Notes On Discrete-Time Signal Processing
155 pages
Solucionario Pollack
No ratings yet
Solucionario Pollack
7 pages
Solutions Manual Corrections Linear Algebra Olver
No ratings yet
Solutions Manual Corrections Linear Algebra Olver
7 pages
Solving Poisson's Equation Using The FFT
No ratings yet
Solving Poisson's Equation Using The FFT
9 pages
8.condensed Matter Physics NET-JRF VKS
0% (1)
8.condensed Matter Physics NET-JRF VKS
27 pages
Notes Fys4480
No ratings yet
Notes Fys4480
117 pages
Math EE IB
No ratings yet
Math EE IB
13 pages
Series Solution of Differential Equations: Concept Mapping
No ratings yet
Series Solution of Differential Equations: Concept Mapping
17 pages
Advanced RF Design Techniques
No ratings yet
Advanced RF Design Techniques
17 pages
HW1 Solution
No ratings yet
HW1 Solution
5 pages
Lab #3 Digital Images: A/D and D/A: Shafaq Tauqir 198292
No ratings yet
Lab #3 Digital Images: A/D and D/A: Shafaq Tauqir 198292
12 pages
OS Past Papers
No ratings yet
OS Past Papers
15 pages
Electronics For Analog Signal Processing - K. Radhakrishna Rao
No ratings yet
Electronics For Analog Signal Processing - K. Radhakrishna Rao
26 pages
Advanced Error & Fresnel Integrals
No ratings yet
Advanced Error & Fresnel Integrals
15 pages
Lagranges Interpolation Formula For Unequal Interval
0% (1)
Lagranges Interpolation Formula For Unequal Interval
18 pages
Antenna Theory: Balanis Solution Upto 6chp
78% (23)
Antenna Theory: Balanis Solution Upto 6chp
162 pages
Fourier Evecs
No ratings yet
Fourier Evecs
4 pages

Solution Manual

Uploaded by

Solution Manual

Uploaded by

You might also like