SoLuTIoNs MANUAL To ACCOMPANY HWANG
ADVANCED
COMPUTER
ARCHITECTURE
PARALLELISM
SCALABILITY
PROGRAMMABILITY
HWANG-CHENG WANG
University of Southern California
Junc-Gen Wu
National Tiawan Normal University
‘McGraw-Hil, Inc.
New York St.Louis San Francisco Auckland Bogota
Caracas Lisbon London Madrid Mexico City Milan Montroal
New Delhi SanJuan Singapore Sydney Tokyo Toronto‘Solutions Manual to Accompany Hivang
ADVANCED COMPUTER ARCHITECTURE
Paralelism, Scalability, Progranmatilty
Copyright © 1999 by MeGrave-Hif, nc. All rights reserved.
Print in ho Unitod Statos of Amavica. Tho contonis or
parts thereof may be reproduced for use with
ADVANCED COMPUTER ARCHITECTURE
Panalleism, Scalability, Programabiliy
by Kai Hwang
provided such reproductions bear copytight notice, ut may not
be reprodueedin any form for any odor purpose without
permission of the publisher.
ISBN 0-07-0916236
234567890 HAM HAM co98765Foreword
Preface
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
L
2
3
4
5
6
7
8
9
Chapter 10
Chapter 11
Chapter 12
Bibliography
Contents
Parallel Computer Models
Program and Network Properties
Principles of Scalable Performance
Processors and Memory Hierarchy .....
Bus, Cache, and Shared Memory
Pipelining and Superscalar Techniques ...
Multiprocessors and Multicomputers
Multivector and SIMD Computers...
Scalable, Multithreaded, and Dataflow Architectures
Parallel Models, Languages, and Compilers
Parallel Program Development and Environments
UNIX, Mach, and OSF/1 for Parallel Computers
173
. 183Foreword
Dr. Hwang-Cheng Wang and Dr. Jung-Gen Wa have timely produced this Solutions
Manual, I believe it will benefit many instructors using the Advanced Architecture:
Parallelism, Scalability, Prograramatility (ISBN 0-07-031622-8) as a required textbook.
Drs, Wang and Wa have provided solutions to all the problems in the text. Some
of the solutions are unique and have been carefully worked out. Others contain just a
sketch of the underlying principles or computations involved. For such problems, they
hhave provided references which should help instructors find out more information in
relevant sources.
‘The authors have done an excellent job in putting together the solutions. How-
ever, as with any scholarly work, there is always room for improvement. ‘Therefore,
instructors are encouraged to communicate with us regarding possible refinement to the
solutions, Comments or errata can be sent to Kai Hwang at the University of South-
em California. They will be incorporated in future printings of this Solutions Manual.
Sample test questions and solutions will also be included in the future to make it more
‘comprehensive.
‘Finally, I want to thank Dz. Wang and Dr. Wa and congratulate them for a difficult
job well done within such a short time period.
Kai Hwang
{
{
|Preface
‘This Solutions Manual is intended for the exclusive use of instructors only. Repro-
duction without permission is prohibited by copyright laws
‘The solutions in this Manual roughly fall in three categories
« For problem-solving questions, detailed solutions have been provided. In some
cases alternative solutions are also discussed. More complete answers can be
found in the text for definition-type questions.
© For research-oriented questions, a summary of the ideas in key papers is pre-
sented. Instructors are urged to consult the original and more recent publications
in literature.
« For questions that require computer programming, algorithms or basic compu-
tation steps are specified where appropriate. Example programs can often be
obtained from on-line archives or libraries available at many research sites.
Equations and figures in the solutions are numbered separately from those in the
text. When an equation or a figure in the text is referenced, itis cleatly indicated. Code
‘segments have been written in assembly and high-level languages. Most codes should
be self explanatory. Comments have been added in some places to help understanding
the function performed by each instruction.
We have made tremendous effort to improve the correctness of the answers. But a
few errors might have been undetected, and some factors might have been overlooked in
our analysis. Moreover, several questions are likely to have more than one valid solution;
solutions for research-oriented problems are especially sensitive to progress in related
areas. In the light of these, we welcome suggestions and corrections from instructors.
Acknowledgments
‘We have received a great deal of help from our colleagues and experts during the
preparation of this Manual. Dr. Chi-Yuan Chin, Myungho Lee, Weihua Mac, Fong Pong,
Dr. Viktor Prasanna, and Shisheng Shang have contributed solutions to a number of
the problems. Chien-Ming Cheng, Cho-Chin Lin, Myungho Lee, Jih-Cheng Lin, Weihua,
Mao, Fong Pong, Stanley Wang, and Namhoon Yoo have geuerously shared their ideas,
through stimulating discussions. We are indebted to Dr. Bill Nitzberg and Dr. David
Black for providing useful information and pointing to additional references. Finally,
our foremost thanks go to Professor Kai Hwang for many insightful suggestions and
Judicious guidance.
H.C. Wang
3.6. WaChapter 1
Parallel Computer Models
Problem 1.1
AS X1432x241SK2+RXZ _ 185
ASX 14+ 32x 2415x248? _ 158 L155, r
Pl Pore ESCEN Joo = 155 eveles/instruction,
40_x 10%cycles/sec
1.55 cycles/instruction
(48000 x 1 + 32000 x 2+ 15000 x 2 + 8000 x 2}cycles
(40 x 10®)cycles/s
‘The execution time can also be obtained by dividing the total number of instructions
by the MIPS rate:
(45000 + 32000 + 15000 + 8000)instructions
25.8 x 10° instructions/s
MIPS rate = 10-8 x 25.8MIPS.
Execution time = = 3.875 ms.
3.875 ms.
Problem 1.2 Instruction set and compiler technology affect the length of the ex-
ecutable code and the memory access frequency. CPU implementation and control
determines the clock rate. Memory hierarchy impacts the effective memory access time.
‘These factors together determine the effective CPI, as explained in Section 1.1.4.
Problem 1.3
(a) The effective CPI of the processor is calculated as
15 x 10° cycles/sec
opr = 1o x10
PI = T6510 instructions/see
= 15 cycles/instroction.
(b) The effective CPI of the new processor is
(140.8 x 240.05 x 4) =
.8 cycles/instruction.
i2 Parallel Computer Models
‘Therefore, the MIPS rate is
30 x 10° eyeles/sec
TS cyelesfinstruction ~ 16-7 MIPS.
Problem 1.4
(a) Average CPI = 1x 0.6 +2% 0.18 +4 x 0.12 +8 x 0.1 = 2.24 cycles / instruction.
(b) MIPS rate = 40/2.24 = 17.86 MIPS.
Problem 1.5
(a) False. The fundamental idea of multiprogramming is to overlap the computations
of some programs with the 1/0 operations of other programs.
(b) True. In an SIMD machine, all processors execute the same instruction at the same
time, Hence it is easy to implement: synchronization in hardware. In an MIMD
machine, different processors may execute different instructions at the same time
and it is difficult to support synchronization in hardware.
(c) True, Interprocessor communication is facilitated by sharing variables on a mal-
tiprocessor and by passing messages among nodes of a multicomphter. The mul-
ticomputer approach is usually mare difficult to program since the programmer
must pay attention to the actual distribution of data among the processors.
(a) False. In general, an MIMD machine executes different instruction streams on
different processors.
(e) True, Contention among processors to access the shared memory may create hot
spots, making multiprocessors less scalable than multicomputers.
Problem 1.6 The MIPS rates for different machine-program combinations are shown
in the following table
es Machine
Program | Computer A [ Computer B | Computer C
Program i | _100 0 5
Program 2] ___07 T 5
Program 3 | 02] __ Oa 2
[Program 4 1} 612s 1
‘Various means of these values can be used to compare the relative performance of
the computers. Definition of the means for a sequence of positive numbers 41,02, ...,0n
are summarized below. (See also the discussion in Section 3.1.2.)
(a) Arithmetic mean: AM = (SZ, a,)/n.
() Geometric mean: GM = (JTL, ai)!"Parallel Computer Models 8
(c) Harmonic mean: HM = n/[S2.,(1/a:)}.
In general,
AM > GM > HM. qa)
Based on the definitions, the following table of mean MIPS rates is obtained:
Computer A | Computer B | Computer C
‘Arithmetic mean |___25.3 281 3.25
Geometric mean 119 0.59 2.66
‘Harmonic mean 0.25 0.20 21
Note that the arithmetic mean of MIPS rates is proportional to the inverse of the
harmonic mean of the execution times. Likewise, the harmonic mean of the MIPS
rates is proportional to the inverse of the arithmetic mean of execution times. The two
observations are consistent with Eq. 1.1
If we use the harmonic mean of MIPS rates as the performance criterion (i.e., each
program is executed the same number of times on each computer), computer C has
the best performance. On the other hand, if the arithmetic mean of MIPS rates is
‘used, which is equivalent to allotting an equal amount of time for the execution of each
program on each computer (i.e. fast-running programs are executed more frequently),
then computer A is the best choice.
Problem 1.7
© An SIMD computer has a single control unit. The other processors are simple
slave processors which accept instructions from the control unit and perform an
identical operation at the same time on different data. Bach processor in an
MIMD computer has its own control unit. and execution unit. At any moment,
‘a processor can execute an instruction different from the other processors,
‘* Multiprocessors have a shared memory structure. The degree of resource sharing
is high, and interprocessor communication is carried out via shared variables in
the shared memory. In multicomputers, each node typically consists of a pro:
cessor and local snemory. The nodes are connected by communication channels
which provide the mechanism for message interchanges among processors. Re-
source sharing is light among processors.
+ In UMA architecture, each memory location in the system is equally accessible
to all processors, and the access time is uniform. In NUMA architecture, the
access time to a memory location depends on the proximity of @ processor to
the memory location, Therefore, the access time is nonuniform. In NORMA
architecture, each processor has its own private memory; no memory is shared
among processors, Each processor is allowed to access its private memory only.
In COMA architecture, such as that adopted by KSR-1, each processor lias its
private cache, which together constitutes the global address space of the system,
Its like a NUMA with cache in place of memory. A page of data can be migrated
to a processor upon demand or be replicated on more than one processor.4 Parallel Computer Models
Problem 1.8
(2) The total number of cycles needed on a sequential processor is (44448444 2-4
4) x 64 = 1664 cycles. :
(b) Bach PE executes the same instruction on the corresponding elements of the vectors
involved. There is no communication among the processors. Hence the tent
number of cycles on each PE is 4+4+8444244= 26,
(©) The speedup is 64 with a perfectly parallel execution of the code.
Problem 1.9
Because the processing power of a CRCW-PRAM and an BREW-PRAM is the
same, we need only focus on memory accessing. Below, we prove that the time com:
Plexity of simulating a concurrent write or a coneurrent read on an EREW.PRAM in
Ologn). Before the proof, we assume it is known that an EREW-PRAM can sort a
numbers or write a number to n memory locations in O(log n) time.
(a) We present the proof for simulating concurrent writes below.
1. Create an auxiliary array A of length n. When CROW processor P,, for i
0,1,...0—1, desires to write a datum, @; to a location I;, each corresponding
EREW processor P; writes the ordered pair (I,2:) to location Ali) These
writes are exclusive, since each processor writes toa distinct memory location.
> Sort the array by the first coordinate of the ordered pairs in O(log n) time,
which canses all data written to the same location to be brought together in
the output,
§: Bach EREW processor P,, for i = 1,2,..,m—1, now inspects Afi) = (lj)
and Afi 1] = (l4,z4), where j and & are values in the range 0< j,k
) We present the proof for simulating concurrent reads as follows:
1. Create an auxiliary array A of length n. When CROW processor P,, for
£5 Ot. =1, desires to read a datum from a location I, each corresponding
ZREW processor P, writes the ordered three-tuple (i,1,.2,) t0 location Alf},
in which the a is an arbitrary number. These writes are exclusive, since cack
Processor writes to a distinct memory location,
2 Sort the array A by the second coordinate of the three-tuple in O(log) time,
which causes all data read from the same location to be brouzht together in
the output.Parallel Computer Models 5
3, Bach EREW processor P,, for i = 1,2,....m—1, now inspects Ali) = (j,1j,23)
and Afi~ 1] = (k,lk,2x), where j and k are values in the range 0 < j,k <
n-1 Il #k, ori = 0, then processor P,, for i = 0,1,..)n—4, reads
the datum from location |; in global memory. Otherwise, the processor does
nothing, Since the array A is sorted by the second coordinate, only one of
‘the processors reading from any given location actually succeeds, and thus
the read is exclusive.
4, Each EREW processor P; that read a datum stores the datum to the third
coordinate in Ali], and then broadcasts it to the third coordinate of lj)’s for
G=i+i+2,..., and lj =k. This takes O(log) time.
5. Sort the array A by the first coordinate of the ordered three-tuple in O(log n)
time.
6. Each EREW processor P; reads data in the third coordinate from Afi]. These
reads are exclusive, since each processor reads from a distinct memory loca-
tion,
‘This process thus implements each step of concurrent reading in the common-
RCW model in O(log n) time.
Problem 1.10 For multiplying two n-bit binary integers, there are 2n bits of input,
and 2n bits of output,
‘Suppose the circuit in question, in the grid model, is a rectangle of height h and
width w as shown in the following diagram:
Assume without loss of generality that h < w. and there is at most one word along:
each grid line. It is possible to divide the circuit by a line as shown in the figure. This
line runs betweon the grid lines and runs vertically, except possibly for « single jog of
one grid unit. Most importantly, we can select. the line so that at least 2n/3 of the
output bits (ie., 1/3 of the output bits) are emitted on each side. We select the line
by sliding it from left to right, until the first point at which at least 2n/3 of the output
bits are output to the left of the line.
If no more than 4n/3 of these bits are output to the left, we are done. If not, start
from the top, considering places to jog the line back one unit to the left. We know that
if the line jog at the every top, fewer than 4n/3 of the bits are emitted to the left, and
if the line jogs at the very bottom, more than 2n/3 are. Thus, as no single grid point6 Parallel Computer Models
can be the place where as many as n/3 of the bits are emitted, we can find a suitable
place in the middle to jog the line. There, we shall have between 2n/3 and 4n/3 of the
output bits on each side.
Now assume without loss of generality that at least half of the input bits are read
on the left of the line, and let us, by renumbering bits, if necessary, assume that these
aT Day Sak; --» Fqr- Suppose also that output bits yi,, Yiay ——» Yiggyy ATE OUPUL on.
the right. We can pick values ¢o that 4, =i. Yi = 22k, and so on. ‘Thus information
regarding the 2n/3 input bits, 2, Zak, -. Zakny3, Inust cross the line.
We may assume at most one wire or circuit element along any grid line, so the
number of bits crossing the line in one time unit is at most +1 (h horizontal and one
vertical, at the jog). It follows that (h+1)P > 2n/3, or else the required 2n/3 bits
‘cannot cross the line in time. Since we assume w > h, we have both hT = %n) and
wT = O(n). Since wh = A, we have AT? = O(n4) by taking the product. That is,
AT? > kn,
Problem 1.11
(a) Since the processing clements of an SIMD machine read and write data from dif
ferent memory modules synchronously, no access conflicts should arise, Thus any
PRAM variant can be used to model SIMD machines.
(b) ‘The processors in an MIMD machine can read the same memory location simul-
taneously. However, writing to a same memory location is prohibited. ‘Thus the
CREW-PRAM can best model an MIMD machine.
Problem 1.12
(a) Phe memory organization changed from UMA model (global shared memory) to
NUMA model (distributed shared memory).
(b) The medium-grain multicomputers use hypercube as their interconnection net-
works, while the fine-grain multicomputers use lower dimensional k-ary n-cube
(e.g., 2D or 3-D torus) as their interconnection networks.
(c) In the register-to-register architecture, vector registers are used to hold vector
operands, intermediate and final vector results, In the memory-to-memory archi-
tecture, vector operands and results are retrieved directly from the main memory
by using a vector stream unit.
(4) Ima single threaded architecture, each processor maintains a single thread of control
with limited hardware resources. In a multithreaded architecture, each processor
can exernte multiple contexts by switching among threads.
Problem 1.13
Assume the input is A(i) for 0 | 10 ° a 3
NS | NN :
2
8 2
WW 0
Problem 2.11
(a) To design a direct network for a 64-node multicomputer, we can use
« A3D torus with 4 nodes along each dimension. The relevant parameters are:
d= 3[r/2| =6, D = 3[r/2] = 6, andl = 3N = 192. Also, dx Dx! = 6912.
@ A G-dimensional hypercube. The relevant parameters are: d 6,
D=n=6, and l= nx N/2=6 x 64/2 = 192. We have dx Dx | = 6912.
= A CCC with dimension & = 4. The relevant parameters are: d= 3, D =
Bk — 1+ [k/2| = 2x 4-14 [4/2] = 9, andl = 3N/2 = 96. The value of
ax Dx Lis 259218
(b)
Program and Network Properties
If the quality of a network is measured by (dx Dx 1)", then a CCC is better
than a 3-D torus or a 6-cube. A 3-D toms and a 6-cube have the same quality.
+ The torus and hypercube have similar network properties and are treated
(14 6)x6
i 2
of information ii. between nodes at a distance i. Then we have
6
P(2) 2 PB) = A.Pa= FP) = BPO) = zz
‘Therefore, the mean internode distance is 5
>
together. We have ) = 21. Denote by P(i) the probability {
P(t)
3g 2
Ix ptax Sean can Saxe ox d=
G49)x9
= 2
ode communication for distance 7 are {
PQ) one = re =$re@=3
PC) = 2,P)= Pe =%
»
* For the CCC, we have )° 45. The probabilities of intern- t
- P=
S
Hence, the mean internode distance is
9 Sia tig 6 1 _ 165
Datta tox toe +5 gtx 1 tt Zeox Bb e
In conclusion, the mean internode distance of 4-CCC is greater than that of 6-
cube and 3-D torus. 6-cube and 3-D torus have identical mean internode distance.
‘The similarity of the 6-cube and 3-D torus in the above is more than incidental. In
fact, it has been shown [Wang88] that when k = 4 (as is the case for this problem),
a k-ary n-cube is exactly a 2n-dimensional binary hypercube.
Problem 2.12
fa)
It should be noted that we are looking for nodes that can be reached from No in
exactly 3 steps. Therefore, nodes that can be reached in 1 or 2 steps have to be
excluded.
© For an &x8 Dliac mesh, they can be caleulated by the equation (a-+b-+e) mod
64, where a,b, and ¢ can be +1,~1,+8, or ~8, There are 20 combinations
(4 if a,b, and c are all different; 12 if two of them are equal; 4 if a = 6 = c)
However, 8 of the combinations contain the pair +1 and -1 or the pair
+8 and —8, making them reachable in one step. Such nodes have to be
eliminated from the list. Hence, 12 nodes can be reached from Np in three
steps. The addresses of these nodes are 3, 6, 10, 15, 17, 24, 40, 47, 49, 54,
58, and 61.Program and Network Properties 19
« Fora binary 6-cube, the binary address as...0; 49 of a node reachable in three
steps from No has exactly three 1's. There are 20 possible combinations
(C(6,3)).. The addresses of these nodes are 7, 11, 13, 14, 19, 21, 22, 25, 26,
28, 35, 37, 38, 41, 42, 44, 49, 50, 52, and 56.
# The nodes reachable in exactly three steps can be determined as follows. List
all 6-bit murmbers which contain three 1s. There are 20 such numbers. First
take 1's complement of each number and then add 1 to each of the resulting
numbers. (Equivalently, the new numbers are obtained by subtracting each
of the original numbers from 64.) If a new mumber has three or four 1s in
its binary representation and the Is are separated by at least one 0, then
both nodes whose addresses are the original number and the new number
can be reached in exactly three steps. (The last point of the rule is due to
the fact that clustered 1s can always be replaced by two 1s.) The addresses
of these nodes are 11, 13, 19, 21, 22, 28, 25, 26, 27, 29, 35, 37, 38, 39, 41,
42, 43, 45, 51, and 53.
(b) The upper bound on the minimum number of routing steps needed to send data
from any node to another for an 8 x 8 Iliac mesh is 7 (= V64 — 1), for a 6-cube is
6, and for a 64-node barrel shifter is 3 (= logy 64/2).
(c) The upper bound on the minimum number of routing steps needed to send data
from any node to another is 31 for a 32 x 32 Illiac mesh, 10 for a 10-cube, and 5
for a 1024-node barrel shifter
Problem 2.13 Part of Table 2.4 in the text is duplicated below:
Network Bus Waitistage Crossbar
Characteristics | __ System Network Switch,
Mininnim Tateney
for unit data | Constant O(log, n) ‘Constant
transfer
Bandwidth TGafnj to Ow) | Ova} to Otway | Ola) to OCR)
per processor
Wiring OCay ‘Onawtog ny Oiaray
Complexity,
Switching omy ‘OGriog, wy Ow
Complexity 7 jo
Connectivity | Only oie to one Some permutations | All permutal
and routing ata time. and broadcast, if | one at a time.
capability network unblocked
Remarks — ‘Assume n proce- [nx n MIN ‘Assume n Xn
ssors on the bus; | using k x & crossbar with
bus width is w | | switehes with line | tine width of |
bits, width of w bits. | w bits |20 Program and Network Properties
Problem 2.14
(a) For each output terminal, there are 4 possible connections (one from each of the
input terminals), so that there are 4 x 4 x 4x 4 = 256 legitimate states.
(b) 48 (= 16 x 3) 4 x 4 switch modules are needed to construct a 64-input Omega
network. There are 24 (= 4x 3 x 2x 1) permutation connections in a 4 x 4 switch
module. Therefore a total of (244) permutations can be implemented in a single
pass through the network without blocking.
{c) The total number of permutations of 64 inputs is 641. So the fraction is 24**/6
La x 10-7,
Problem 2.15
(a) We label the switch modules of a 16 x 16 Baseline network as below.
OG
TE SEE SE!
Eee:
‘Then, we change the positions of some switch modules, the Baseline network
becomes:Program and Network Properties 21
* rR i a
which is just an Omega network.
(b) If we change the positions of some switch
modules in the Baseline network, it
becomes:
-EE ERE ah
a oe /{= S
4 "Vp A Ke *
ca a = RE ‘
Fea > eee ee cs
ENE >RIERE
ep 7 ee 3
opp See ye oy ;
which is just the Flip network,
{c) Since both the Omega network and the Flip network are topologically equivalent
to the baseline network, they are topologically equivalent to each other.22,
Program and Network Properties
Problem 2.16
(a)
(b) nlk/2}
(c) 2K
(a) 2n,
fe)
A k-ary Loube is a ring with k nodes.
A kary 2cube is a 2D k x k torus.
‘A mesh is a torus without end-around connections.
A 2ary n-cube is a binary n-cube.
An Omega network is the multistage network implementation of shuffie-
exchange network. Its switch modules can be repositioned to have the same
interconnection topology as a binary n-cube
{f) The conventional torus has long end-around connections, but the folded torus has
equal-length connections. (See Figure 2.21 in the text).
(gs)
# The relation
B= 2wN/k
will be shown in the solution of Problem 2.18. Therefore, if both the number
of nodes NV and wire bisection width B are constants, the channel width W
‘will be proportional to &
w= B/b = Bk/(2N).
‘The latency of a wormhole-routed network is
Zyfp
uty
Twa =
which is inversely proportional to w, hence also inversely proportional to
k. This means a network with a higher k will have lower latency. For two
k-ary n-cube networks with the same number of nodes, the one with a lower
dimension has a larger k, and hence @ lower latency.
+ If will be shown in the solution of Problem 2.18 thas the hot-spot throughput
is equal to the bandwidth of a single channel
Ons =k/2
‘Low-dimensional networks have a larger k, hence a higher hot-spot through-
put.Program and Network Properties 23
Problem 2.17
(a) In a tree network, a message going from processor i to processor j goes up the
tree to their least common ancestor and then back down according to the least
significant bits of j. Message traffic through lower-level (closer to the root) nodes
is heavier than that of higher-level nodes. The lower-level channels in a fat tree
has a greater number of wires, and hence a higher bandwidth. This will prevent
congestion in the lower-level channels,
(b) The capacity of a universal fat tree at level k is
cy = min([n/2*), fw/2*
#9 Ik > Blog(n/w), [n/2*} < [w/2*!9), Therefore, cx = [n/2*] = (n+1)/2*,
which is 1, 2,4, .., for k = log(n + 1),log(n + 1) = 1, log(n + 1) = 2,
© If & < 3log(n/w), then [70/2] > [w/2%*/}. Hence cg = [wo/2*/5], which
5s ny w/879,w/42!, wo /29, for b= «.58,2,1.
‘© Initially, the capacities double from one level to the next toward the root,
but at levels less than 3log(n/w) away from the root, the channel capacities,
grow at the rate of 4,
Problem 2.18
(a) A Kary n-cube network has V nodes where N = k*, Assume k is even. If the
network is partitioned along one dimension into two parts of equal size, the “cross
section” separating the two parts is of size N/k. Corresponding to each node in
the cross section, there are two wires, one being the nearest-neighbor link and the
other wraparound link in the original network. Therefore, the cross section contains
2N/k wires each w bits wide, giving a wire bisection width B = bu = 2w.N/k.
‘The argument also holds for k odd, although the partitioning is slightly more
complex.
(b) The hot-spot throughput of a network is the maximum rate at which message
can be sent from one specific node P, to another specific node P;. For a k-ary
n-cube with deterministic routing, the hot-spot throughput, @5, is equal to the
bandwidth of a single channel w. From (a), w = KB/(2N). Therefore,
us = kB/(2N),
which is proportional to & for a fixed B.24 Program and Network Properties
Problem 2.19
(a) Embedding of am rx r torus.in a hypercube is shown in the following diagrams
for r = 2 and 4, respectively ((2) and (c)). As can be seen, if the nodes of a torus
‘are numbered properly, we obtain inter-node connections identical to those of a
hypercube (nodes whose numbers differ by a power of 2 are linked directly),
A 2r x 2r torus can be constructed from r xr tori in two steps. In step one,
# 2r xr torus is built by combining an r x r with its “anirror” image (in the sense
of node numbering) and connecting the corresponding nodes, as shown in diagram
(b). In step 2, the 2r xr torus is combined with its mirror image to form a 2r x 2r
torus. In this manner, a torus can be fully embedded in a hypercube of dimension
dwith 24 =r? nodes.
In general, it has been shown that any my x2ny--- m; torus, where my = 2°
can be embedded in a hypereube of dimension d = py + pe ++-+ p, with the
proximity property preserved using binary reflected geay code for the mapping
[Chans6}.
{b) Embedding of ring on a COC ie equivalent to finding a Hamiltonian cycle on
the CCC. In the following figure, the embedding of rings ox CCCs for k = 3 and
4, respectively, is shown. It is easy to first consider the embedding of a ring on
a binary hypercube by treating the cycle at each vertex of the hypercube as a
supernode. ‘This step can be carried out easily and there are several possible ways
to embed a ring on a hypercube. Then, whenever a supernode on the embedded
ing is visited, all the nodes in the corresponding eycle are Tinked.
|
|Program and Network Properties 25
(c) Embedding of a complete balanced tree in a mesh is shown in the following diagram
for trees of different heights. In general, the root of a tree is mapped to the center
node of a mesh, and leaf nodes are mapped to outlying mesh nodes. The process
is recursive. Suppose a tree of height ! > 3 has been embedded in an r x 7 mesh.
‘When embedding a tree of height !+ 1, an (r +2) x (r +2) mesh is needed, with
the new leaf nodes mapped to the boundary nodes of the augmented mesh. This
is illustrated for J = 3.26 Program and Network Properties
©
Problem 2.20
(a) A hypernet combines the hierarchical structure of a tree in the overall architecture
and the uniform counectivity of a hypercube in its building blocks.
(b) By construction, a hypernet built with identical modules (buslets, treelets, cubelets,
etc.) has a constant node degree. This is achioved by a systematic use of the ex-
terual links of each cubelet when building larger and larger systems.
(c} The hypemet architecture was proposed to take advantage of localized communi-
cation patterns present in some applications such as connectionist neural networks,
‘The connection structure of hypernets gives effective support for communications
between adjacent lower-lovel clusters. Global communication is also. supported,
‘but the bandwidth provided is lower. Algorithms with commensurate nonuniform
‘communication requirements among different components are suitable candidates
for implementation on hypernets.
1/0 capability of a hypernet is furnished by the external links of each building
Block. Asa result, I/O devices can be spread throughout the hierarchy to meet I/O
demand. Fault. tolerance is built into the hypernet architecture to allow graceful
degradation. Execation of a program can be switched to a subnet in case’of node
or link failures. The modular construction also facilitates isolation and subsequent
replacement of faulty nodes or subnets.Chapter 3
Principles of Scalable Performance
Problem 3.1
(a) CPL = 1x 0.642% 0.18-+4 x 0.12 +12 x 0.1 = 2.64 cycles / instruction.
(b) MIPS rate = —=stthevcles/s__. _ 69 goMIPS.
Tecycles instruction
(c) When a single processor is used, the exccution time is ty = 200000/17.86 = 1.12 x
10* ys. When four processors are used, the time is reduced to ty = 220000/60.60
3.63% 10° ss, Hence the speedup is 11,2/3.63 = 3.08 and the efficiency is 3.08/1
0.77.
Problem 3.2
(a) If the vector mode is not used at all, the execution time will be
0.75T + 9 x 0.25T = 37.
‘Therefore, effective speedup = 32/T = 3. Let the fraction of vectorized code be
a. Then a =9 x 0.257 /3T = 0.75.
(b) Suppose the speed ratio between the vector mode and the scalar mode is doubled.
‘The execution time becomes
O.78T + 0.257 /2 = O.8TST-
‘The effective speedup is 31/0.875T = 24/7 = 3.43.
(c) Suppose the speed for vector mode computation is still nine times as fast as that
for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio
‘@ must satisfy the following relation:28 Principles of Scalable Performance
Solving the equation, we obtain a = 51/64 = 0.8.
Problem 3.3
{a) Suppose the total workload is W million instructions. Then the execution time
seconds is w
pa e® U-ow
nz Sr
‘Therefore, the effective MIPS rate is
w ne ne
TF atal—a) n-(—ija’
(b) Substituting the given data into the expression in (a), we have {
ex 4
ext Lay,
i6-i5a
which can be solved to give a = 24/25 = 0.96.
Problem 3.4 Assume the speed in enhanced mode is n times as fast as that in regular
mode, the harmonic mean execution time T’ is calculated as
T(a} = a/R+ (1-a)/(nR),
where R is the execution rate in regular mode.
(a) Ha varies linearly between a and b, the average execution time is '
_ £EPla)da _ (A= b+0) +2
Fong = “ay ‘mR
‘The average execution rate is
Rea Be HTP
and the average speedup factor is
6, = Row
jag R D(b+a) +2
({b) Ifa +0 and 6 1, then
Sug = Inf (n+ 1).
Problem 3.5Principles of Scalable Performance 29
(a) The harmonic mean execution rate in MIPS is
1
Ro ATR
‘The arithmetic mean execution time is
T= AIR.
rt
(b) Given fi = 0.4, fe = 0.3, fs = 0.2, fy = 0.1, and R, = 4 MIPS, Ry = 8 MIPS,
Ry = 11 MIPS, Ry = 15 MIPS, the arithmetic mean execution time is T =
04/4 + 0.3/8 + 0.2/11 +0.1/15 = 0.162 ps per instruction.
Several factors cause Rj to be smaller than 5i. First, there might be memory
access operations which take extra machine cycles. Second, when the number of
processors is increased, more memory eecess conficts arise, which increase the ex-
ccution time and lower the effective MIPS rate. Third, part of the program may
have to be executed sequentially or can be executed. by only a limited number of
processors simultaneously. Finally, there is an overhead for processors to synchro-
nize with each other. Because of these overheads, Rj/i typically decreases with
i
(c) Given a new distribution fy = 0.1, fo = 0.2, fs = 0.3, and f, = 0.4 due to the
use of an intelligent compiler, the arithmetic mean execution time becomes T =
0.1/4 + 0.2/8 + 0.3/11 +0.4/15 = 0.104 ys per instruction.
Problem 3.6 Amdahl’s law is based on a fixed workload, where the problem size is
fixed regardless of the machine size. Gustafson’s law is based on a scaled workload,
where the problem size is increased with the machine size so that the solution time is
the same for sequential and parallel executions. Sun and Ni's law is also applied to
scaled problems, where the problem size is increased to the maxinauu memory capacity,
Problem 3.7
(a) The total number of clock cycles needed is
1024
y= S(2 421) = 2 1004-4 1024 x 1095 = 1,051,628
Ss
(b) If consecutive outer loops are assigned to a single processor, the workload is not
balanced and the parallel execution time is dominated by that on processor 32.
‘The clock cycles needed is,
10%
Y e+2)
So
= (993 +1025) x 18
= 64,608,
a30 Principles of Scalable Performance:
The speedup is, aon
Sar * 1628.
(c} To balance the load, we divide the outer loop into 64 chunks, each consisting of 16
iterations. Each processor is allocated a pair of chunks in a fold-over manner. That
is, processor 1 is allocated the first and the last chunks, processor 2 the second and
second to the last chunks, and so on. Thus, we have the following modified code:
Doalll £ = 1, 32
Do 101 = (¢-1) x 16 +1, £x16
sUM()
Do 20J=1,1
20 SUM(I) = SUM(I) + 3
10 Continue
Do 301 = (64~@) x 16 + 1, (4-£4+1) x 16
SUM(D) = 0
Do 403 =1,1
40 SUM() = SUM() +3
30 Continue
Endall
=
(4) Suppose the overhead associated with flow control is neglected. The number of
cycles required for the computation in processor f,1 < €< 32, is,
bas (e-21998
Q = Do @xitn+ SY exits
e-fmenn seer“ tiaet
= {@-1)x16+24¢x 1641) +
(64-8 x 16+ 2+ (64 — E44) x 16+ 1} x 16
= 2054 x 16 = 32, 864.
‘The speedup in this case is
__ 1051648 _
= "nese =
Problem 3.8
(a} An example program is shown below. Assume a,,7 are the base addresses of
A,B,C, respectively, which point to the first clement of the individual arrays.
Also assume only a small number of registers are available. The notation M(addr)
stands for the value stored in memory location addr.
Mov —-R1,0 Initialize R1 index §
Mov — 5,0 Initialize R5 = fxn
Mov —-R7,0 Initialize R7 = offset of Crs
Loop: Moy —-B2, 0 Reset R2 — index jPrinciples of Scalable Performance 31
Loop2: Mov R30 Reset R3 = index k
Mov RA, 0 Reset RA =k x0
Mov 6, RS ixn
Mov R11, 0 = value of Ci;
Loops: Add R4, R2 Compute offset for Bas
Load 9, M(R4 + 8) Fetch Bey
Load R10, M(R6 + a) Fetch dix
Mul R10, R9 Az x Ba
Add Ru, R10 Update Ci;
Inc RB Increment k
Inc RO Increment offset for Aux
Add R4yn Compute kx n
Cmp R3,n Check limit for k
Jnz Loops Loop until limit is reached
Store M(R7 +4), Ri Store value of Ci;
Inc RP Increment j
Inc RT Increment offset for Cy;
Cmp R20 Check limit for j
Jaz Loop? Loop until limit is reached
Ine RY Increment i
Add R5,n Compute i xn
Cmp Rijn Check limit for i
Ine Loop] Loop untit limit is reached
(b) From the above code, the number of instructions is,
T= 10n? +9n? +5n +3.
For timing analysis, the following number of cycles is assumed for different types
of instructions:
«© Add, Mul, Cmp, Ine: 2 cycles.
© Load, Store: 4 cycles.
© Mov, Inc: 1 cycle.
Based on the above assumptions, we obtain the following serial execution time:
T, = (2an® + 14n? + 8n +3) cycles.
‘The average cycles per instruction can be calculated as
Ty _ Dank 4 Mn? + Ont 3
a= Pant Mant Ont 8 tas / is
= eT ovetas / instruction
Asymptotically, CPI is close to 2.2 when n is large.
(c} If the clock rate is 40 MHz, a rough estimation for the MIPS rate is
oPl=32 Principles of Scalable Performance
{a} Matrix A is partitioned into blocks by the row and matrix B by the columa, as
shown in the following diagram:
Cu Cia ToT Cw
Ca | Cap | | Caw
x Bow |=([D =e 7
Cys | Ow-12 [| Ow
Cys | Cua Ch
Each block A, represents the row vectors Ajaynjwaae through Ain ye.
Similar notations are used for the block submatrices B. ‘The multiplication of one
A block of size (n/N) x n and a B block of size 2 x (n/N) yields one subblock of
size (n/N) x (n/N) in the product matrix @.
‘The amount of lime required for the computations in each processor is 22(n/IV)?n+
L{n/N))n+8n/N-+3, provided each processor is identical to the uniprocessor used
in (a) and memory access conflicts are ignored. Each processor needs to compute NV
such subblocks. Thus, the total parallel execution timeis (22n3 /IN+L4n?+8n+3N)
cycles. The potential speedup is (22n? + 14n?48n+3) /(22n3 /N+14n?-+8n-+3N) ~
JV when m is large.
(e} The matrix is partitioned as in part (4). Initially, node i has submatrix A;. and
Bs for i= 1..N. In the first step, each node computes a subblock Cj; of matrix
. “After that, the nodes exchange subblocks of B in the following manner: Node 1
sends its B block to node 2, node 2 sends its B block to node 3, ..., Node IV sends
its B block to node 1. Then each node computes a subblock C,,.,; except node
N which computes Cy. The process is repeated until all the subblocks of C’ are
computed in N steps. If the initial distribution of B to the nodes is not counted,
the number of message-passing steps is N ~ 1
‘The sequence of computations is lustrated in the following diagram for N = 4,
‘with different shades indicating subblocks computed in different steps. Each block
corresponds to a C subblock of size (n/4) x (n/4).Principles of Scalable Performance 33
(f) Assume each message consists of a single element of matrix B, which is 8 bytes for
double-precision floating-point numbers. The message sending operation for node
din step j (1 < j < V1) can be specified as follows:
/* Process sending messages */
gej+i-a
if (jj == N41) ij = 1do
fork = (jj-l)+n/N+1tojjen/Ndo
for] =1tondo
if ( NN) send(1, B(i,k), 8);
else send(i+1, BOK), 8)
enddo
enddo
‘The first parameter of the send instruction is the destination node ID, the
second is the element of B to be transmitted, and the third is the length of the
message. There should also be code in the node receiving the message; it is similar
to the sending counterpart. For simplicity the code is not shown. By symmetry,
each node must execute both sending and receiving instructions.
‘The execution time tne can be divided into the time for arithmetic operations
(tq) and that for communication (t.), assuming there is no overlap between the
two types of operations. The time for arithmetic operations is identical to that on
a shared-memory multiprocessor, which is t, = 22n?/N + lin? + 8n+3N. The
total number of message-passing operations is (N ~ 1) x (n/N) xn. Thus,
‘te = (22n°/N + 14n? + 8n + 3N + 100(N — 1) x (n/N) x n) cycles.
‘Therefore, the speedup is
a 22n + Lin? + 8n +3.
Bani ]/N + Un? + Bn + BN + 100(N — 1) x JN) KH
Note that different assumptions for this problem will lead to different speedup
results. It is also possible to use other matrix multiplication algorithms such as
those described in the text.
Problem 3.9
(a) Arithmetic mean exccution time of each machine is calculated as follows:
® Machine A: (1 + 1000 + 500 +100) /4 = 400.25 s.
© Machine B: (10 + 100 + 1000 + 800)/4 = 4775 s
© Machine C: (20+ 20+ 50 + 100)/4 = 47.5 5.
(b) Harmonic mean MIPS rates:
© Machine A: 100/400.25 = 0.25 MIPS.
@ Machine B: 100/477.5 = 0.21 MIPS.
© Machine C: 100/47.5 = 2.1 MIPS.84 Principles of Scalable Performance
(c) In terms of harmonic mean execution rate, Machine C is higher than Machine A,
which is in turn higher than Machine B. See also the discussion in Problem 1.6.
Problem 3.10
(a) The total execution time in serial execution is
ra) = Swe.
a
In parallel execution with m processors, the execution time is
Wr
vee.
2 ala
Therefore, the speedup for the fixed-memory model is
Tn)
MG) Wy =0 fori #1 andi én, (i) Wy = Wa, (iil) WE = Gln)W,, then
spa Mit Gin)
Wi +GQ)Wi/n
{b) When G(n) = I, ie., problem size remains fixed when memory size is increased,
Arndabl’s law is obtained:
Wr+Wr
WIW, fn
(e) Whea G(n) = n, ic., problem size increases in direct proportion to memory size
increase (which in turn is proportional to the number of processors), Gustafson’s
law is obtained:
Sn
1 _ Wan,
= Wis,
(d) Let Wy =a, Wy = 1a. The relation 5,, < 5%, follows from the definitions. Thus,
gti-a _atn(i~a)
a+(i-a) n
Satn(l~a)= 5.
We now show $3 > $%. Assume G(n) = 9 > n.
(3.4)
(3.2)Principles of Scalable Performance 35
Let 8 =(1~-a)/n. Eqs. 3.1 and 3.2 can be rewritten as
y _atnig
Sn a Enp @a)
and
ot 908 (a4)
ata
Consider three different cases:
La=1,6=0. 5,=S,=1
2. @=0,8=1 SL =S=n.
3. 01,g>n,
which completes the proof.
Problem 3.11 In a reasonable execution environment, the workload and exectition
times should satisfy the following conditions:
1. At most n instructions can be executed by m processors simultaneously:
T(n) < O(n) < nT(n), (3.5)
2. O(n) should be at most n times as large as O(1):
(1) < O(n) B(n) by Eq. 3.6. From Eq. 3.5, we have
a <1 = Rin) < 1/B(n) (3.19)
‘The proof is completed by combining the inequalities.
Ulm) = R(n)BE36 Principles of Scalable Performance
(b) The result is obtained by combining Eqs. 3.6, 3.9, and 3.10.
fe)
S(n}E(n)
RG)
TQ) TQ) 0G)
Tin) nT(n)
TQ) Ta) Ta)
Fin) aT (n) O(n)
TQ)
nTX{njOwn)
An)
(from Eq. 3.8)
(4) The following inequalities can be easily shown to hold for n > 1:
Afn < (w+3)/(An) < (n+ 3)(n + logy n)f(4n2) <1
1 (n+ logyn)/n < An/(n+3) )
Re = 08x 10+0.2R: =8+0.2R, Milops
Ry = 10°°R2? Milops.
Ry = 1/(0.8/10 + 0.2/Rz) = 10R2/(0.8Rz + 2) Mitops.
()
Ry = 0.2 10-+0.8R, = 2+ 0.8R; Mflops.
Ry = 10° R2* Mitops.
Ry = 1/{0.2/10-+0.8/Ry) = 10g (0.2K + 8) Mflops.Principles of Scalable Performance 39
8 3
ge 8
(d) Suppose harmonic mean MIPS rate is used as the criterion to compare the relative
performance of the three machines. For machine 1, the value is
a) L 100
© fif100 + f2f0.1 ~ 1000 — 998);
Similarly, for machines 2 and 3, the values are
—
fo+ RA
c
RP
—%h
and
.
Be TBE
respectively. We can plot RO, R@, R® as a function of fy. The following diagram
shows the variation of the harmonic mean MIPS rate for the three machines with
respect to fx40 Principles of Scalable Performance
It is seen that {is very sensitive to the value of f, RE varies slowly with
‘fr, and RG) remains constant independent of the value of f;. For most values of
Ja, RO has a larger value than RC and RE. But when fy is close to 1, RO and
RY surpass RO. A large value of f, means that most of the time is spent on
the high MIPS benchmark for machines 1 and 2, leading to a high harmonic mean
MIPS rate for the two machines.
Problem 3.14 The communication cost of a data exchange between two directly
connected nodes is modeled by @ + fm, where a is the time required to set up a
channel, # is the time to transmit one word over the communication channel, and m is
the amount of data exchanged.
(a) In Fox-Otto-Hey algorithm, each processor in a Yi x fA torus is assigned s?/re
elements of matrices A and B. The algorithm requires a total of \/n iterations,
During each iteration, a subblock of A is broadcast along the row of the torus,
‘Therefore, in each iteration, the time taken for communication is /i(a + s?8/n).
Since /7i iterations are required, the total time for communicaiton is (na + 5°).
However, if the matrix subblocks are sent in a pipelined fashion (such as wormhole
outing), the second term is reduced by a factor of 2/,/f (for details, please see
{Foxs7}}, resulting in (ner + 2s%8//m) for communication overhead.
If the torus is embedded in a hypercube and a sophisticated one-to-all broad-
cast scheme such as the one in [Johusson89} is used, the communication time can
be further reduced to (logna + 2s°A/yn). Therefore the total: communication
overhead on n processors is nlog na +25 /A8.Principles of Scalable Performance 41
(b) Berntsen’s algorithm is designed to take advantage of the higher connectivity in
hypercube computers. Matrices A and B are partitioned into 2 strips by column
and by row, respectively, as follows:
Ba
Ar | Az} As] Aad - c
‘The product matrix C is computed as
a
C= AB
The hypercube is divided into 2* subcubes, each comprising 2°* nodes. The first
step of the algorithm involves the computation of O; = A;Bj, which is carried out
in each subcube using Cannon’s algorithm (Cannon69]. See solution of Problem
8.12 for an example. The communication overhead in this step is'
T= 22042").
”
In the second step, C; in the subcubes are summed together using a “cascade
sum" algorithm (Berntsen90}. This step requires communications among subcubes
with an overhead 7
Tokar ep,
‘The total communication overhead is n(T; +73). Using the relation
2* = n¥/3, the complexity of the communication overhead is proved.
(c) In Deke!-Nassimi-Sabni algorithm, multiplication is performed on a hypercube with
24 = s° nodes. Each node r in the hypercube is identified by a 3-tuple (i, 3h),
15 i,j,k $24 = 5, At the beginning, elements Aj, and By, are stored in node
(0,,4). At the end, Cx is also stored in node (0, j,).
‘The algorithm consists of three phases. In the first phases, Ajx and Byy are
replicated on nodes (i,3,4),1 Ty
Here the workload parameter for Ry is omitted since it is assumed to be independent
of workload. Based on the definition of isoefficiency, we have
an
NIWRi
Lal
NTw NEw
and the isospeed condition is obtainedChapter 4
Processors and Memory Hierarchy
Problem 4.1
(a) Processor design space is @ coordinated space with the x and y axes representing
clock rate and CPI, respectively. Bach point in the space corresponds to a de-
sign choice of a processor whose performance is determined by the values of the
coordinates.
(b) The time required between issuing two consecutive instructions.
(c) The number of instructions issued per cycle.
(a) The number of cycles required for the execution of a simple instruction, suck as
add, move, ete.
(e) Two or moze instructions attempt to use the sume frnctional unit at the same
time.
(£) A coprocessor is usually attached to a processor and performs special functions at
a fast speed. Examples are floating-point. and graphical coprocessors.
(g) Registers which are not designated for special usage, as opposed to special-purpose
registers such as base registers or index registers.
(h) Addressing mode specifies how the effective address of an operand is generated so
that its actual value can be fetched from the correct memory location,
(4) In the case of @ unified cache, both data and instructions are kept in the same
cache. In split caches, data and instructions are held in saparate caches,
(5) Hardwired control: Controt signals for each instruction are generated by proper
circuitry such as delay elements. Microcoded control: Each instruction is imple-
mented by a set of microinstructions which are stored in a control memory. The
decoding of microinstructions generates appropriate signals to control the execu-
tion of an instruction
4a44 Processors and Memory Hierarchy
Problem 4.2
(a) Virtual address space is the memory space required by a process during its execu:
tion to accommodate the variables, buffers, ete., used in the computations.
(b) Physical address space is the set of addresses assigned to the physically available
memory words.
(c) Address mapping is the process of translating a virtual address to a physical ad-
dress.
(a) The entirety of a cache is divided into fixed-size entities called blocks. A block is
the unit of data transfer between main memory and cache.
(e) Multiple levels of page tables used to translate a virtual page number into a page
frame number. In this case, some tables actually store pointers to other tables,
similar to indirect addressing mode. The objective is to deal with a large memory
space and facilitate protection.
(1) Hit ratio at level i of the memory hierarchy is the probability that a data item is
found in M;.
(g) Page fault is the situation in which a demanded page cannot be found in the main
memory and has to be brought in from the disk.
(h) A hash function maps an element in a large set to an index in a small set. Usually
it treats the input element as a number or a sequence of numbers and performs
arithmetic operation on it to generate the index. A suitable hash function should
snap the input set uniformly into the output set.
(i) An inverted page table contains entries that record the virtual page number asso-
ciated with each page frame that has been allocated. This is contrary to a direct
‘mapping page table.
(4) The strategies used to select page or pages resident in the main memory to be
replaced in case such needs arise.
Problem 4.3
(a) A windowing system divides the register file on a machine into groups which are
assigned to different processors. There is usually overlap among the register sets
to provide a fast communication mechanism among cooperating procedures for
parameter passing and to allow fast context awitching. The use of a large number
‘of GPRs allows less frequent memory accesses and speeds up program execution.
(b) A large register file and a large data cache both serve the purpose of reducing
memory traffic. From implementation point of view, the same chip area can be
used for either a large register file or a large data cache. From programming point
of view, registers can be manipulated by program code, but cache is transparent
to the user. In fact, data cache is primarily involved in load/store operations. The
addressing of a cache involves address translation and is more complicated thanProcessors and Memory Hierarchy 45
that of a register file
Reservation stations and reorder buffers are used in superscalar machines to
facilitate instruction lookahead and internal data forwarding which are needed to
schedule multiple instructions through multiple pipelines simultaneously.
(c) In most RISC processors, the integer unit executes load, store, integer, bit, and
control transfer functions. {t also fetches instructions for the floating-point unit in
some systems. The floating-point unit performs various arithmetic operations on
floating-point numbers. The two units can operate concurrently.
Problem 4.4
(a) The comparison is tabulated below:
Tem CBC RISC
Tnstruction | 16-64 bits fixed (32-bit)
format __| per instruction format
‘Addressing 1224 limited to 3-8
modes {mostly register-based,
except load store)
‘CPI 2-15, on the average 5 | < 1.5, very dose tol _|
(b) © Advantages of separate caches:
1. Double the bandwidth because two complementary requests are ser-
viced at the same time.
2. Simplify logic design as arbitration between instruction and data ac-
cesses to the cache is simplified or eliminated
3. Access time is reduced because data and instruction can be placed close
to the functional units which will access them. For instance, instruction
cache can be placed close to the instruction fetch and decode units.
© Disadvantages of separate caches:
1. Complicate the problem of consistency because data and instruction
may coexist in the same cache block. This is true if self-modifying code
is allowed or when data and instructions are intermixed and stored in
the same cache block. To avoid this would require compiler support to
ensure that instruction and data are stored in different cache blocks.
2. May lead to inefficient use of cache memory becanse the working set
size of a program varies with time and the traction devoted to data
‘and instruction also varies. Hence, the sum of data cache sie and
instruction cache size is usually larger than the size of a unified cache.
As a result, the utilization of instruction cache and/or data cache is
likely to be lower.
For separate caches, dedicated data paths are required for both instruction
and data caches. Separate MMUs and TLBs are also desirable for separate
caches to shorten the time of address translation. 4 higher memory band-
width should be used for separate caches to support the increased demand.46 Processors and Memory Hierarchy
In actual implementation, there is tradeoff between the degree of sapport
provided and the resulting hardware complexity.
(c) © Instruction issue: Scalar RISC processor issues one per cycle; superscalar
RISC can usually issue more than one per cycle.
* Pipeline architecture: In an m-issue superscalar processor, up to m pipelines
may be active im any base cyele. A scalar processor is equivalent to a
superscalar processor with m = 1.
« Processor performance: An m-issue superscalar can have a performance rm.
times that of a scalar processor, provided both are driven by the same clock
rate, no dependence relation or resource conflicts exist among instructions.
(€) Both superscalar and VLIW architectures employ multiple functional units to al-
ow concurreut instruction executions. Superscalar requires more sophisticated
hardwate support such as large reorder registers and reservation tables iu order to
make efficient use of the system resources. Software support is needed to resolve
data dependences and improve efficiency.
Jn VEIW, instructions are compacted by compiler which explicitly packs to-
gether instructions which can be executed in concurrency based on heuristics or
run-time statistics. Because of the explicit specification of parallelism, the hard-
ware and software support at run time is usually simplified. For instance, the
decoding logic can be simple.
Problem 4.5 Only a single pipeline in scalar CISC or RISC architecture is active
at a time, exploiting parallelism at microinstruction level. Operation requirement. is
simple. In a superscalar RISC, multiple pipelines can be active simultaneously. To do
50 requires extensive hardware and software support to effectively exploit instruction
parallelism, In a VLIW architecture, multiple pipelines can be active at the same time.
Sophisticated compilers are noeded to compart irregular codes into a long instruction
word for concurrent execution.
Problem 4.6
(42) i486 is a CISC processor. ‘The following diagram shows the general instruction
format, A few variations also exist for some instructions.
FEET ost Td seta] a3ztcttnone sauszi6iinone
o UIess2 0
ye eS
epcodett ores) “inadrm” “ci ates inmedine
FFalepeodeniy "bye Bye —_highcement oo,
x pimas aa byes
egies ass a
‘node specter
Data format:Processors and Memory Hierarchy aT
# Byte (8 bits): 0-255
* Word (16 bits): 0-64K
© DWord (32 bits): 0-4G
© bit integer (8 bits): 107
© 16-bit integer (8 bits): 10
¢ 32-bit integer (8 bits): 10°
# 65-bit integer (8 bits): 10"
© &bit unpacked BCD (1 digit): 0-9
© &-bit packed BCD (2 digits): 0-9
© 80-bit packed BCD (18 digits): +10!
* Single-precision real (24 bits): 41048
‘* Double-precision real (53 bits): 1088
‘* Extended-precision real (64 bits): 1044992
‘» Byte string, Word string, DWord string, Bit string to support ASCII data
types.
(b) There are 12 different modes whereby the effective address (EA) can be gener-
ated:
register mode
* immediate mode
# direct mode: EA — displacement
«© register indirect or base: EA + (base register)
© based with displacement: EA + (base register) + displacement
index with displacement: EA + (index register) + displacernent
@ scaled index with displacement: EA — (index register) x scale + displace-
ment
based index: BA + (base register) + (index register)
© based scaled index: BA + (base register) + (index register) x scale
« based index with displacemont: EA + (base register) -+ {index register} +
displacement
© based scaled index with displacement: EA + (base register) + (index reg-
ister) x scale + displacement
«© relative: New PC — PC + displacement. (used in conditional jumps, loops,
and call instructions)
(c} Instruction categories:
@ data transfer: MOV dst, src
arithmetic: ADD dst, sre
* logic, shift, and rotate:
AND dst, ste
SHL dst, count
ROL dst, count48
Processors and Memory Hierarchy
string comparison: CMPS sdst, ssre
dit manipulation: BT dst, bit
control transfer: JMP addr
Righ-level language support: LEAVE (procedure exit)
protection support: LSL dst, src (load segment limit)
floating: point operation: FADD tmp
floating-point control: FINIT (initialize PPU)
{d) HLL support instructions:
BOUND reg, adds; check if (addr) < (reg) < (addr + S)
ENTER imml6,imm8 _; make stack imml6 bytes at nesting level imm8
LEAVE
SETec byte 3 set byte on cond, reset byte to 0 otherwise
Assembly directives: commands to the assembler, not executable, For instance
the following directives define e data eegment:
DATAL SEGMENT
DATAL ENDS
(e) Interrupt, debugging, and testing features,
» Interrupt: i486 can handle up to 256 different interrupts, $2 of which are
reserved for Intel, the others can be designed by users. The starting address.
of an interrupt service routine is called an interrupt vector. The interrupt
vectors are stored in an interrupt vector table (IV'T). When an interrupt
‘occurs, relevant register values are pushed onto a stack. The interrupt nurn-
ber is used by CPU to retrieve the corresponding interrupt vector from the
IVT. After the interrupt service routine is executed, the program resumes
execution of the interrupted instruction.
« Two types of test are available:
1. built-in self test: tests nonrandom logic realized by PLAs, control ROM,
TLB, and on-chip cache
2. external tests: can be performed on TLB and on-chip cache.
© Three types of on-chip debugging aids are provided:
1, code execution breakpoint opcode: can be inserted at any desired break-
point.
2, single-step capability
3. ende and data breakpoint capability provided by debug registers.
(£) 80486 allows the execution of 8086 application programs in two modes:
© Real mode: This has the same base architecture as.the 8086 but the program
is allowed to access the 32-bit register set of the 80486. Default operand
size is 16 bits. Paging is not allowed in this mode, ‘The maximum memory
size is limited to 2° = 1 Mbyte.Processors and Memory Hierarchy 49
© Virtual mode: This mode allows 8086 application programs to take full
advantage provided by i486. In this mode, i486 can execute any 8086, 80286,
and 80386 software. Also, paging allows more flexible address mapping, The
Iinear address space available is 2°? = 4 Gbytes and virtual address space
is 24° bytes.
(2) By setting PG bit (bit 31) of control register CRO to 0, paging is disabled. ‘This can
be controlled by software. When paging is disabled, the linear address generated
by segmentation mechanism is the same as the physical memory address and can
be used directly to access the data from memory. Paging is used to cope with
external fragmentation problem, but it also slows down the system. Applications
which have stable memory requirement throughout the execution may use this
feature (paging disable) to improve efficiency.
By selecting a segment size of 4 Gbytes, the entire linear address space becomes
a single segment, which essentially disables the segmentation mechanism. In this
case, segment offset, linear address, and physical address are all identical. Segmen-
tation provides a logical view of the memory space and facilitates protection and
sharing of data, If the system is dedicated to a single application program which
requires huge memory space, segmentation can be disabled.
(i) Four levels of protections, called privilege levels, are provided:
level 0 (PL = 0): kernel ‘most privileged
level 1 (PL = 1): system services
level 2 (PL = 2): OS extensions
level 3 (PL = 3): applications Teast privileged
Data stored in a segment with PL = p can be accessed only by code with PL < p.
‘A code segment. with PL = p can be called only by a task executing with PL < p.
(j) Intel 1586 has been renamed to Pentium. No detailed information on the processor
is available yet.
Problem 4.7
(a) (1) The general format of an instruction in the i860 is shown below:
BE 26.25 2420 IS TT HO 6
r
[epoca [oss [om | oe [tometon anes woal
‘There are several variants to this format. Floating-point instructions
also have a similar format, but provide bit fields for specifying precision of
operands, pipeline mode, and dual-instruction mode.
Data formats supported include
© Load/store references 8, 16, 32, 64, 128-bit operands.
¢ Integer operations are performed on 32-bit operands.50 Processors and Memory Hierarchy
‘= Integer arithmetic operations support 8 and 16-bit operands by sign-
extending the operands to 32 bits
'» Floating point numbers follow IEEE 754 Standard (see Chapter 6)
© Graphical pixels of 8, 16, and 32 bits are supported. However, re-
gardless of pixel size, i860 always operates on 64 bits of pixels at a
time.
(2) Four basic addressing modes are supported:
© Offset: absolute address into the first or last 32 Kbytes of the logical
address space.
© Register: operand in a CPU register.
+ Register indirect + offset: EA = const + (reg).
‘* Register indireot + index: BA «~ (regl) + (reg2)
(3) Instruction categories:
© Load/store instructions:
12X load integer
«+ Register-to-register move instructions:
inf; transfer integer to F-P register
Integer arithmetic instructions:
addu —; add unsigned
Shift instructions:
sh]; shift left
Logical instructions:
andnot ; logical AND NOT
Control-transfer instructions:
intovr _; software trap on integer overflow
System control instructions
flush; cache flush
(b) i860XP executes hardware snooping instructions, whereas in the previous genera-
tion, i860XR, multiprocessor cache consistency requires software to avoid cacheing
shared writable data.
(c) Dual operation mode refers to the simultaneous execution, under the supervision
of floating-point. control unit, of floating-point operations in the adder and mul-
fiplier. Such operations ean be specified by dual-operation instructions such as
‘Sub-Multiply or Add-Muktiply.
‘Dual instruction mode refers to the capability for the integer unit and foating-
point unit to execute instructions in parallel. Programmers can specify dual-
instruction mode by using assembler directives or by explicitly modifying the op-
code mnemonics,Processors and Memory Hierarchy 51
(4) i860 has a virtual address space of 2 bytes. ‘Translation of viriual address to
physical address is optional and is in effect only when ATE (Address ‘Translation
Enable) bit in the directory base register is set to 1 by the operating system. The
format of virtual address is as follows:
31 2221 Ru °
Dir Page Ofiset
The address translation mechanism uses the Dir field as an index into a page
directory, which is pointed to by the DB (directory table base) field of the di-
rectory base register. The Page field is used as an index into the page table
determined by the page directory. The offset field is used to select a byte within
the page determined by the page table. Bach page has a size of 4K bytes. This
address translation is illustrated in the following diagram:
Page able
Problem 4.8
(a) The allocation of registers is shown in the left-hand side of the following diagram
when the total number of registers NV is 40.52 Processors and Memory Hierarchy
(b) If. = 72, the registers can be organized into 4 windows’as shown in the right-hand
side of the diagram. Note that in both figures, the eight globally shared registers
are not shown.
(c) The scalability of SPARC architecture refers to the number of register windows
that can vary with different SPARC implementations.
(a) A calling procedure can pass parameters to a subroutine by writing them into the
OUT registers which overlap with the IN registers of the subroutine. Likewise, the
results obtained by the subroutine can be passed back by Jeaving them in the OUP.
registers which are the IN registers of the calling procedure.
Problem 4.9
(a) Two situations may cause pipelines to be underutilized: (i) the instruction latency
is longer than one base cycle, and (ii) the combined cycle time is greater than the
base cycle.
(®) Dependence among instructions or resource contlicts among instructions can pre-
vent simultaneous execution of instructions
Problem 4.10
(a) Vector instructions perform identical operations on vectors of length usually muc
larger than 1, Scalar instructions operate on a number or a pair of numbers at a
time.
(b) Suppose the pipeline is composed of & stages and the vector is of length V. The first
output is generated in the k-th cycle. Afterward, an additional output is generated
in each cycle, The last result comes out of the pipeline in cycle (N + k— 1). Using
a bave scalar machine, it takes Né cycles. Thus the speedup is Nk/(N + &— 1).Processors and Memory Hierarchy 53
(c) If m-issne vector processing is employed, each vector is of length N/m. Therefore,
the execution time is (N/m + k ~ 1) cycles. If only parallel issue is used, the
execution time is (N/m)k. Thus, the speed improvement is
Nim+k-1_ Nk
(jm) ~ W¥m(k-1)
Problem 4.11
(a) The average cost is
esi + 698
ats
For ¢ to approach cz, the conditions are 32 >> 5; and 82 >> 151
(b) The effective access time is
ta = Do fits = hath + (1— fa)hate = ht + (1 Adie.
(c) If ty =rty. Then t. = (a+ (1—A)r)ty
E=b/t, =1/(h+ (1- Ar).
(d) The plottings are shown in the following diagram:
9)
os)
(e) fr = 100, we have B = 1/(h-+ (1 ~h) x 100) > 0.95. Solving the inequality, we
obtain the condition:
4 5
I> grpg © 99-95%34 Processors and Memory Hierarchy
Problem 4.12
(a) The average access time is
yt + (1 hy)hate = hth + (1 — AYO, = (20 - 9h)ey
Wh=07, then ty = 3.74 = 74ns. Ifh = 0.9, then t, = 1.91 = 38 ns. Ifh = 0.98,
then t, = 1.18 = 23.6 ns.
(b} The average byte cost is
eis, + e282 _ 20eps1 ter x 4
r ats 31 + 4000
20x 0.23; +0.2% 4000 _ ds; +800
‘1 + 4000 a + 4000
For 6; = 64,128, and 256, the average cost is 0.26, 0.52, and 0.43, respectively.
(c) For the three design choices, the product of average access time and average cost is
19.24, 12.16, and 10.15, respectively. ‘Therefore, the third option is the best choice.
Problem 4.13 In a system with private virtual memory, processors communicate
with each other through message passing. The latency depends on the interconnection
topology and channel bandwidth, As the system grows larger, latency becomes longer.
‘There is no data sharing among the processors. Therefore, data coherence is not a
problem. Data may migrate from one node to another. But once a data reaches a
destination node, it becomes private data to that node.
In implementation, message passing is facilitated by a pair of commands (send
and receive) or through remote procedure calls. Since message passing is essentially
1/0 operations, it is much more expensive than local memory accesses. As a result,
applications which can be partitioned into tasks requiring little interaction with each
other are suitable for implementation on such machines.
In a system with globally shared virtual memory space, the private memory assaci-
ated with individual processors forms a uniform address space visible to all processors.
Data can be shared as in a shared-memory multiprocessor. Access Iatency may vary,
depending on the physical memory location of the data. Some systems allow replication
of data. to reduce the latency. Data can be migrated from one processor to another in
pages or other logical units upon demand. If replicated data can be written by multiple
processors, data coherence will be an issue which needs to be addressed.
Actual implementation differs. Some systems allow read-only data to be duplicated
{Li89]. Others allow replication of writable data as well. In either case, mechanisms
must be provided to track the location of each data unit (page or object) and enable
fast transportation of data. The complexity can grow rapidly with the system size. In
spite of a globally shared address space, access time still varies with the actual location
ofthe data. Therefore, applications with good spatial and temporal localities would be
suitable candidates. Applications with less regular communication patterns can also be
implemented, although the performance is likely to be degraded. In general, since readProcessors and Memory Hierarchy 55
operations are much cheaper than write operations, applications with high read/write
ratios are particularly suited.
Problem 4.14
(a) Inclusion property refers to the property that information present in a lower-level
memory must be a subset of that in a higher-level memory.
(b) Coherence property requires that copies of an information item be identical through-
‘out the memory hierarchy.
(c) Write-through policy requires that changes made to a data item in a lower level
memory be made to the next higher level memory immediately.
(a) Write-back policy postpones the update at level (i +1) memory until the item is
replaced or removed from level 1 memory.
(e) Paging divides virtual memory and physical memory into pages of fixed sizes to
simplify memory management and alleviate fragmentation problem.
(1) Segmentation divides the virtual address space into variable-sized segments. Bach
segment, corresponds to a logical unit. The main purpose of segmentation is to
facilitate sharing and protection of information among programs.
Problem 4.15
(a) LRU page replacement gives the following result, with » at the bottom of a columa
indicating a page fault:
1 [0 [2 Tlé[7fofil2joysjotays [i
= po rtle[rfotp2)opspolats
—-{=|1 2tijifet7[oyrs2 [2 t3yotea
SSt= op2;ejrfet7 {Tafa fe] spo
5 [274 Tl[e[7[2[4l2;7[3f3y2[3
1 [5 [2 6l[7[el7i2;al2r7i 7] 32
ps s[s[sjel7it}alaztatrtr
apa yi a{als[sieresepatatapat
* : efe i
In each cycle, the miost recently referenced page is brought to the top page
frame. As a result, the top row traces out the original page reference stream. The
hit ratio using LRU is 16 / 23
(b) In circular FIFO scheme, the page frames are organized as a circular queue Q. The
age frames in are referenced as Q(Z),0 < i S 3. A pointer P is used in conjunction
with the usage bits U(I},0 << 8 to decide which page is to be replaced in case
of a page fault. Initially the pointer points to the first free page frame (P = 0)56
Processors and Memory Hierarchy
and the usage bits are all set to 0 (U(0 : 3) = 0). The behavior rules are specified
below:
© Page fault on page J:
U(P) =1;
P=(P+1) mod 4;
© Page hit on a page resident in page frame J:
)P=(P +1) mod4;
‘The update to the pointer in the event of a page hit is to avoid replacing the page
innediately in an ensuing page fault.
Based on the initial conditions and behavior rules, it is eeay to write a program
{0 trace the contents of the page frames in response to the reference stream. In the
following table, we show the evolution of the arrays and the pointer. An asterisk
(+) at the end of a row indicates that a page fault has occured for a particular
age reference.
HX) 20) 2) 93) UO) VA) Ue) Ue) P
-— = = 1 0 0
z 0 1
a 0 2
1 1 3
4 i 3
1 1 3
1 1 0
6 0 1
6 0 1
6 0 2
6 1 3
6 0 Oo «
6 0 0
3 0 1
3 0 2
3 1 3
3 0 0
1 Q 1
1 0 t
1 0 2
I 1 3
L 1 0
6 0 1
6 0 2
MBN KE SScOSCOSCCODOSSCOCS
BARR RRO Ree URN |
RE OR Ooh eRe
o
0
0
°
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
i
1
1
0
0
1
1
0
1
1
1
1
0
0
0
1
1
0
1
eer wmmmmwerm eras) | ||Processors and Memory Hierarchy 57
ococr He Hee
HH coHHEHSS
cocoon ooo
Hen ocounn
307
For this particular reference stream, the hit ratio is 16 / 33, which is the
same as that for the LRU scheme. However, the contents of the page frames are
somewhat different for the two schemes.
Note that different behavior rules have been proposed in literature, which may
sive rise to slightly different results
4 1
Problem 4.16
(a) Temporal locality refers to the property that recently used data or instructions are
likely to be reused in the near future. Spatial locality refers to the property that a
process tends to access data or instructions stored in consecutive locations. Sequen-
tial locality is related to the observation that the execution order of instructions
tends to follow the sequential program order.
(b) Working set is the subset of addresses or pages referenced within 2 given time
window or a given number of most recent references. It approximates the program
locality property. Pages in the working set are considered actively used and should
reside in main memory. If the window size is large, the resident pages may encom-
pass several locality regions and the size of working set is likely to grow. A large
window size should improve hit ratio. But in a multiprogrammed environment,
keeping a large number of pages in the memory for each process will exhaust. the
page frames and cause thrashing. On the other hand, a small window size gives
rise to a small working set and may lower hit ratio because actively used pages
shift with time.
(c) 90-10 rules: It has been empirically observed that 90% of the execution tiie is
spent on approximately 10% of the code. The rule reflects program locality as a
short segment of code gets executed repeatedly and the data accessed tend to be
contiguous elements of a lange array structure.
Problem 4-17
(a) The effective access time is
tag = thy + fo(1 = hi Jltg = thls + to(1 = hy) = 0.954; +0.05t.
(D) The total cost is ¢ = ers: + e365.58
(c) 1, We have the following inequality:
Processors and Memory Hierarchy
0.01 x 512 x 1024 + 0.0005 x #2 < 15,000.
‘Therefore 5 cannot exceed 18.6 Mbytes,
2. The following inequality is obtained:
20 x 0.95 + 0.05 x ts < 40,
Hence, t2 < 420 ns.
Problem 4.18
[[Atiributes [Symbolic processing |
[Numeric processing _]
‘Data objects: Lists, relational databases, Integer, floating-point numbers,
sexpis, semantic nets, frames, | vectors, matries
blackboards, objects,
production systems
Conition Search, sri, pattem wiat@ig,” | AGG, wabivaa, map ATS
operations | filtering, contexts, partitions, | matrix multiplication,
transitive closures, uniiction, | matrix-vector multiplication,
text retrieval st operations, | reduetion operations ke dot
reasoning. product of vectors, et
Memory Targe mematy with Tateaive | Great memory deinand Wh
requirements | access pattern. Addressing is | intense aces, Access petern
often content-based. Locality of | usually exhibits high degre of
relerence may not hol, tpatial and temporal oealives
‘Gommmonation | Message trai varies inane Mestage trafic and granary
patterns and destination; granularity | are relatively uniform, Proper
and format of metsage unis | mapping can restrit
hange with application. | eammuniation to largely
‘tween neighboring prowasors
“Rigorthm | Nendetezmmtsie, poaubly | Topically detrsinte
Properties parallel and distributed Ameuable to parallel and
Computations. Data etibuted computations, Data
dependences may be global and | dopendence ie mostly local and
irregular in pattern ane regu.
| iramulaviy "
Tapar Output | Topol ean be graphieaT and’ —} Tange date sats wnialy sited
requirements | audio a wel as from keyboard; | memory capacity, Fast 1/0 ie
acces to very large online| highly desirable
I databases.
[arsine] Pasa opas oT can bo hinaet ema
Features Knowledge base, dynamic oad | processor, MIMD, or SIMD
balancing: dynamic memory | processors using various
allocation; hardware-supported_ | memory and interconnection
| tarbage collection; stack structures. Systolic array i
Drocassor architecture; symbolic | suitable for certain types of
prone computations
}Chapter 5
Bus, Cache, and Shared Memory
Problem 5.1
(a) Maximum bus width is 8 x 20 = 160 Mbytes/s.
(b) Memory access time is defined as the time a memory request is received by the
memory unit to the time at which all the requested information has been made
available at the memory output terminals (Hayes, 1988)
At first, it takes 50 ns (1 bus cycle) for the address to be transmitted to the
memory module. After the data is ready on the memory output port, it takes 50
ns to transfer one word to the processor. In the worst case, the four words are
accessed one by one separately. ‘The total amount of time for a processor to access
‘one word from the memory is
50 ns + 100 ns + 50ns = 200 ns
during which the bus cannot be used by other processors. ‘Thus, the effective
bandwidth is
8 bytes/200 ns = 40 Mbytes/s,
which is one-fourth of the maximum bus bandwidth.
If the memory addresses are interleaved, 0 that. access of the four words can
be performed simultaneously. It takes 50 ns to transmit the address to the memory
module, 100 ns to get the data ready in the latches. Then it takes four bus cycles
to transfer the four words to the requesting processor. Therefore, the total time
required is 5 + 100+4 x 50 = 350 ns, Thus, the effective bus bandwidth is
4.x 8 bytes/350 ns = 91.4 Mbytes/s.
(c) Any of the arbitration schemes discussed in the text can be used. The decision is
based on the desired performance and circuit complexity.
5960 Bus, Cache, and Shared Memory
(€) 40 address lines and 64 data lines are needed. In order to limit the total number
of signal lines to 104, the address lines can serve as low-order data lines by use
of multiplexers. ‘The other lines carry control signals such as bus request, bus
grant, reset data sync, address sync, data ack, arbitration, read/write, etc. For
a description of the functionality of each signal line, consult the specification of
standard buses.
(e) At least 21 slots are needed, one for each processor board, one for each memory
board, and one for the bus controller.
Problem 5.2 In a daisy-chained arbitration scheme, there is only a central arbiter.
‘One bus request line is connected to all processors. A single bus grant line is connected
to processors in a daisy-chain manner, which means that a processor will acquire the
bus only if none of the processors closer to the arbiter requests to use the bus. As a.
result, the scheme works with a fixed priority based on the proximity of processors to
the arbiter.
‘The advantage is its simplicity in installation; additional processors can be added
to an existing chain by sharing the same set of arbitration lines. The simplicity also
makes it feasible to install more than one set of the request and grant lines to improve
system reliability,
‘The disadvantage is the violation of fairness principle by fixed priority assignment.
Also, it takes a long time for the bus-grant signal to propagate along the chain. As a
result, the number of processors that can be effectively supported is sinall.
Ifa distributed arbiter scheme is used, each processor has its own arbiter to which
unique arbitration number (AN) is assigned. When two or more processors request
to use the bus simultaneously, the arbiters bid by sending ANs to a shared bus re-
quest/grant (SBRG) line whose logic selects the maximum among the ANs and leaves
it om the line. Subsequently, each arbiter compares its AN with that on the SBRG in
parallel. Only the request from the arbiter whose AN matches that on SBRG will be
sustained. After the present transaction is finished, the selected processor will seize the
bus.
The advantage of using distributed arbiters is the flexibility of implementing various
priority schemes and fast arbitration time.
‘The disadvantage lies in the complex arbitration structure which increases the
implementation cost.
Problem 5.3
(a) Assume low-order interleaving is adopted in the organization of the memory mod-
ules. Further assume that requests to all memory modules are equally likely. ‘There-
fore, the probability ofa request by processor P, to any memory module M; is p/m,
independent of i and j. The probability of no request to M; from P, is 1 — p/m.
Hence, the probability of no request from any of the processors is (1— p/m)", and
the probability of at least one request from any processor is 1— (1 — p/m)". If
b> m, all the requesis can be satisfied by the bus system. Therefore, the memoryBus, Cache, and Shared Memory 61
bandwidth is estimated as follows:
BW = 0(- (1p)
)") = m (1 ~ (1 ~ p/m)")
When n is large,
(1 — p/m)
n
‘The memory bandwidth is thus m(1—e-"*/™),
(b) Memory bandwidth BW; is the expected number of busy memory modules or sue-
cessful memory accesses in 2 multibus system with b buses. np is the expected
number of memory requests generated by the processors. In general, not all mem-
ory requests can be satisfied because of conflicts arising from (i) more than one
request being made to the same memory module, and (i) inability of available bus
capacity to accommodate all the requests. The presence of conflicts means that not
all the expected memory access requests can be successful. Therefore, BWs < np.
However, as the authors showed in the paper, through proper choice af the design
parameters, most of the memory requests from processors can be satisfied, i.e.,
BW. /np —1.
Problem 5.4 For this problem, it is assumed that each cache miss (read or write)
Jeads to the replacement of a block, which can be occupied or empty, in the cache to
make room for the missing block.
(a) White-through scheme:
write0 5 reall) wnite(05), read(o5)
40008 200s (4004400) as (40420) as
Effective memory access time:
0.95 (0.5 x (400 + 60) + 0.5 > 20) + 0.05 x
(0.5 x (400 + 400) +.0.5 x (400 + 20))
0.95 x 210 + 0.05 x 61
280(ns).
4
tar
462 Bus, Cache, and Shared Memory
(b) Write-back scheme:
Memary access
cache hit0.95) cache miss(0.05)
write) read(05)
read(0) veie(0.5)
2008 i
iny(0.1) —eanfo9) iny(0.1)
(400+20)ns (4004400120) ns (400460) ns (4004004601 ns
Effective memory access time:
typ = 0.95 x (0.5 x 2040.5 x 60) + 0.05 x (0.5 x
(0.9 x (400 + 20) + 0.1 x (400 + 400 + 20) +0.5 x
(0.9 x (400 + 60) + 0.1 x (400 + 400 + 60)))
0.95 x 40 + 0.05 x 480
62(n5)
(c) The memory access time per instruction is
46 us for write-through.
12.4ns for write-back.
‘Therefore, effective execution time per instruction is
Ops +46 ns = 0.146 ns for write-through
O41 ys +124 ns = 0.1124 ps for write-back
‘The effective MIPS rate for each processor is
1/0.146 = 6.85 for write-through.
1/0.1124 = 8.90 for write-back.
‘The upper bound of MIPS rate for the multiprocessor system is
409.6 for write-through.
16 « 6.85
142.3 for write-back
16 x 8.90Bus, Cache, and Shared Memory 63
‘The above upper bounds are obtained by considering only the memory access
time. In fact, itis difficult to achieve the upper bounds since the processors may not
be fully utilized due to data dependence or resource conflicts among instructions.
Problem 5.5
(a) Low-order interleaving refers to the organization of the memory in which the least
significant (low-order) bits of memory address are used to select the memory mod-
ule and the rest (high-order bits) indicate the offset of a word within the selected
module. .
(b) When data blocks in a cache are tagged and indexed by physical memory address,
it is a physical address cache. In contrast, a virtual address cache does not wait for
the physical address to be generated and is accessed by virtual memory address,
It offers improved efficiency by overlapping cache access with physical address
translation. The disadvantage is potential aliasing problem which entails frequent
cache flushing,
(c) In ashared memory, if the update to a memory location is observed by all processors
at the same time, then the memory access is atomic. If the update is not necessarily
observed by all processors simultaneously, the memory access is nonatomic.
(4) Memory bandwidth is the maximum rate at which data can be transferred to or
from the memory. It is determined by memory cycle time, bus width, and memory
ongattization. Effective data transfer rate between memory and processors may be
lower due to conflicts. Fault tolerance of the memory system is the capability to
continue operation with 2 lower bandwidth when one or more memory modules
fail
Problem 5.6
(a) In a write-through cache, an update to a cache block causes the corresponding
memory block to be updated immediately. In a write-back cache, the update to
the memory block is postponed until the cache block is replaced.
(b) Data which are globally shared among several processors and whose values may be
updated can be tagged as noncacheable. Instructions, private dara, and globally
shared readable data are tagged as cacheable. This distinction is an alternative
approach used to avoid cache inconsistency problem,
(c) Private caches are those attached to individual processors; shared caches are shared
among, processors, much like shared memory modiles. These two types of caches
can coexist in a system, For example, shared cache can be used as second-level
cache in a multilevel cache system.
(d) Cache Sushing is used to deal with aliasing problem in a virtual address cache.
Cache flushing policies determine when flushing should be performed and the level
at which flushing takes place (page, segment, context, etc.). Those policies are64 Bus, Cache, and Shared Memory
closely related to operating system design
(e) Cache hit ratio is affected by factors including cache capacity and block size. A
Targe cache improves hit ratio. For a fixed cache size, there is an optimal block
size at which hit ratio peaks. A small block size does not take full advantage of
locality properties A large block size, on the other hand, may load unneeded data
into the cache. In a set-associative cache organization, the number of sets and set
size can also affect hit ratio.
Problem 5.7
{) In order to preserve individual program orders, the first statement executed must be
aor ore. Consider the case where a is executed first. A tree can be constructed,
each branch of which traces ont an execution sequence that preserves individual
program orders. The tree in the following diegram shows the interleavings of
instructions and the corresponding ontput for each interleaving.
4 er comm
o— 5 coun
© So conn
. < toe worn)
anny
nny
e
ee am
7 t— 4 won
se anions
a — + wom
ere comm
ae F cow
ee Lear cay
: se <
__ r— & con
. 4 + omy
6 am
t— + can
eS Se ay
+— @ am
© am
be comms
) The 720 different execution orders cannot generate all the different combinations.
For example, the combination 001100 cannot be generated by any execution order.
The 11 pair at the center of the output requires that two of the assignment state-
‘meats (u, ¢, and e) be executed before the second output statement. Hence at least
two of the three variables already have value 1 before the last output statement is
executed, rendering it impossible to generate the last pair of Os.
Tn fact, out of the 64 possible combinations, only 50 can be generated by the
six statements executed in any order. Many different execution orders generate
identical output sequences, as can be seen in (a).
(c) The sequence 011001 can only be generated by either of the following two execution
orders: efbeda and edafbc. Note, however, that neither of the two execution orders
preserves individual program orders. Therefore, if individual program orders have
to be preserved, then the sequence cannot, be generated. For a more formal proof,
refer to [Dubois88]
(d) Take as an example the sequence 001100 which can not be generated if memory
‘accesses are atomic. Suppose cach processor exceutes sequentially, but the change
to variable values is not immediately observed by all the other processors. Con-
sider the order of execution abcedf which does not violate program order of each
individual program. First, the pair 00 is generated by . Then d will produce the
pair 11, provided processor 2 has observed the changes made by processors 1 and
3. Finally, processor 3, which has not observed the changes to A and B by the
other processors, executes f and prints out 00.
Problem 5.8 The main memory blocks are numbered 0 to 63, the cache block frames
are numbered 0 to 63. The mappings are shown in (a) through (d). In each case, the
address format and cache tag are also shown.66 Bus, Cache, and Shared Memory
(a) Direct mapping:
(b) Folly associative mapping:
EVEElEee] © © » © flcleleleeelte
a aBus, Cache, and Shared Memory 67
(c) Set-associative mapping:
(d) Sector mapping:68 Bus, Cache, and Shared Memory
Problem 5.9
(2) Bach set of the cache consists of 256/8 = 42 block frames, and the entire cache has
16 x 1024/256 = 64 sets. Similarly, the memory contains 1024 x 1024/8 = 131072
blocks. Thus, the memory address format. is as shown in the following. figure:
Qa
—
Cache address tap Se Wind
adress addeess
A block 5 of the main memory is mapped to a block frame in set F of the
cache if F = B mod 64.
() The effective memory access time for this memory hierarchy is 50 x 0.95 + 400 x
(1 ~ 0.95) = 47.5 +20 = 675 ns.
Problem 5.10
(a) The address assignment is shown in the following diagram:
Memory aes ine (10 bis)
oy My Me oy
° t 2 3
4 5 ‘ 7
Lol s 2 ” ii
2 2 « 5
6 1" ® @
1008 1086 cs
i0o8 1010 oH
vo [018
106 von 1088
1020 03 om
aBus, Cache, and Shared Memory 69
(b) There are 1024 / 16
frames in the cache,
64 blocks in the main memory, and 256 / 16 = 16 block
(c) 10 bits are needed to address each word in the main memory: 2 for selecting the
memory module and 8 for the offset of a word within the module. 6 bits are
required to select a word in the cache: 2 bits to select the set mumber and 4 bits
to select a word within a block. Besides, each block frame needs a 4-bit address
tag to determine the block resident in it,
(4) The mapping of memory blocks to the block frames in cache is shown in the
following diagram:
a Main Memory
Tag (bits)
3]a/2)s}alelela|e|e
After the set in which a memory block can be mapped into is identified, the
address tag of the block frames in that set is compared by associative search with
the physical memory address to determine if the desired block is in cache.70 Bus, Cache, and Shared Memory
Problem 5.11
{a) Based on the given data, the following access tree is obtained:
We have the following expression for the average access time
fe = filhie (L—ha)(b-+e)) + (1— f.)(heet (1 ha)((B4+ (1 ~ fai) + (28-40) far)
(b) IF the extra time taken by invalidation propagation ig taken into account, the
average access time is
te
fe t(L— fi)finvi.
Problem 5.12
(4) Cache organization and the relation between physical address and cache address
are shown in the following diagrams.Bus, Cache, and Shared Memory n
‘Cache address tag Sa Byte
address addcess
{(b) (0000104 F};5 = (00000000000000000001000010101111)9. From the address map-
ping shown ia the above diagram, it is clear that the address can be assigned to
any block frame in set 1.
(c) Bits 27 and 28 in {FFF FTAzy):g must be 0 and 1, respectively, in order for the
address to be mapped to the same set. as (0000104 F);¢. In other words, the least
significant bit of z must be 0 and the most significant bit of y must be 1. The
other bits can be either 0 or 1. ‘Therefore, x can be any of the hexadecimal digits
{0,2,4,6,8, 4,C,E}, and y can be any element of (8,9, A, B,C, D, E, P}.
Problem 5.1312 Bus, Cache, and Shared Memory
(a) The effective CPI for each processor can be computed as
cpl=mi4 2 = Limiz
ze
Therefore, the total MIPS rate of a system with p processors is
pr
T+ mtr
=e
MIPS = op
{b) Using the expression derived in (a), we obtain the following equation:
320
St _ Ls6
1+ 042
‘The equation is solved to give = 35/6 = 5.83 in MIPS.
{e) Substituting the given performance data into the equation in (a), the following
MIPS rate is obtained:
5.24 MIPS.
32x2 6
T¥16x1x2~ 42
Problem 5.14
(a) The effective access time for each memory access is
ty = fil — Ri)te + fall ~ halt
‘The CPI in jes can be estimated as
a(rnt, +4) 41
CPL =mte+ 4 4at, =
2 #
‘The effective MIPS of the entire system is thas
se
MIPS = Gp
——
(mt, +4) +2
(b) Using the data given, we have the following values:
ta = 0.5 x (1 ~ 0.95) x 0.5 40.5 x (10.7) x 0.5 = 0.0875
OPI = 0.4 x 0.0875 + # + 0.05 x 5 = 0.485.
And finally,
a
oaas = *
Hence, the number of processors needed is p = 13.Bus, Cache, and Shared Memory 73
(c) The cost of the cache is 4.7 x 16 x (32 +64) = 7219.2. Hence, the total amount
of money allowed for the shared memory is 17781.8, and the memory capacity in
Mbytes is.
17781.8
Cm = Faxed ~
Problem 5.15
(a) The address formats are shown in the following diagrams for the different design
choices:
vei — Go)
cee Goo
et
—]
ak en
pees Coo
(b) In case one memory module fails, the memory bandwidth is as follows:
» Design 1: 0.
© Design 2: 8 words per access.
# Design 3: 12 words per access.
(c) Ina fault-free situation, Design 1 offers the highest memory bandwidth in the case
of vector access. But the entire memory system can be crippled by a single memory.
module failure. The other two designs offer more graceful degradation in case of
module failures, although the bandwidth is not as high as Design 1 under fault-free
conditions.
Problem 5.16
(a) All strides except multiples of 17: 80M words per second; strides of multiples of
17: 20M words per second.
(b) All strides except multiples of 4: 80M words per second; strides of multiples of 8:
20M words per second; strides of multiples of 4 but not. 8: 40M words per second.
Problem 5.17
(a) Using the formula in Bq. 5.1 of Problem 5.7, there are 20 execution interleaving
orders that preserve individual program orders. Trees sitnilar to that given in
Problem 5.7 can be constructed. These possible interleaving orders are: abcdef,
abdeef, abdecf, abdefe, adbcef, adbecf, adbefc, adebef, adebjc, adefte, dabeef,74 Bus, Cache, and Shared Memory
dabecf, dabefe, daebef, daebfe, daefbe, deabef, deabfe, deafbe, defabe.
(b) If program order is preserved and atomic memory accesses are assumed, the fol-
lowing 4-tuple output combinations can be obtained: 0111, 1011, and 1111.
(©) Suppose program order is preserved and nonatomic memory accesses are assumed.
‘Thea before c is executed, A has been set to 1 by @. Similarly, C is set to i
before it is printed by f. Because of nonatomic memory accesses, the value of D
in cand that of B in f are uncertain, Therefore the output can be erl or Tix,
depending on the instruction interleaving. Here the don’t care bit x can be either
0 or 1. Possible combinations are 1001, 1011, 1101, 1111, 0110, 0111, and 1110.
(Pattern 1111 appears in both cases and is shown only once),
Problem 5.18
(a) Hardware complexity and implementation cost is reflected in the mechanism to de-
termine whether a given block is in cache after the block address has been decided:
Direct mapping has the lowest cost, since a simple modulus operation is sufficient.
Fully associative mapping has the highest cost, since an associative search on all
block frames is needed. The relative cost for set-associative and sector mappings
depends on the implementation. In set-associative mapping, associative search is
needed within each set; in sector mapping, it is neoded to determine the sector.
For a fixed cache size, the size of each set or sector will make a difference in cost.
(b} In direct mapping, block replacement is rigid and trivial. The other schemes allow
similar flexibility in the design of replacement algorithms. For instance, all the
replacement algorithms discussed in the text can be implemented with any of the
three mappings. Ia the case of fully associative mapping, the algorithms are applied
to the entire cache. In set-associative or sector mapping, only a subset of the cache
block frames are examined in the application of the replacement algorithms.
(c) Effects of block mapping policy on the hit rat
® Direct mapping: Hit ratio is strongly affected by the reference pattern. If
the reference pattern leads to uniform distribution of working set in the
cache, hit ratio will be high. But if two or more blocks mapped to the same
block frame are referenced alternately, the hit ratio will drop sharply.
* Fully associative mapping: Hit ratio is essentially independent of the refer-
ence pattern, Hit ratio should be high except in the rare case of anomalous
Jack of localities in references.
© Set-associative mapping: On the average, hit ratio should be higher than
direct mapping and lower than fully associative mapping. Thrashing is still
possible, but with a lower probability than direct mapping.
* Sector mapping: Hit ratio is sensitive to the reference pattern. Because
of the mapping scheme adopted, when a block in a sector is replaced, the
other blocks in the same sector are invalidated, which effectively reduces
he number of valid blocks resident in the cache. This is Wkely ta have anBus, Cache, and Shared Memory 15
adverse effect on the hit ratio.
(a)
« For the effect of block size on the cycle count and hit ratio, see the discussion
in pages 236-238 and Fig. 5.14 in the text.
# Set unmber and associativity: For a fixed cache size, the two parameters
are inversely proportional to each other. When the cache number is sumall,
it behaves more like a fully associative cache. .When the number of sets
is large, its behavior is close to that of direct mapping and the hit ratio
is expected to become lower. The actual performance is dependent on the
characteristics of application programs.
© Cache size: With a larger cache, more data and instructions can be held in
the cache, which improves both hit ratio and cycle count.
Problem 5.19
(a) A memory manager performs several functions:
© It keeps track of the memory space being used by individual processes and
their IDs.
«@ It determines which processes to be loaded into memory when memory space
is freed.
© It allocates and deallocates memory space as needed.
(b) Suppose a new block needs to be brought into memory. In nonpreemptive alloca~
tion, the incoming block can only be placed in a free memory block. In a preemptive
allocation scheme, the incoming block is allowed to be placed in a block currently
occupied by another process. Nonpreemptive scheme is easier to implement but
preemptive scheme can make better use of memory space.
(c) In aswapping system, an entire process (instructions and data) is swapped between
main memory and disk. In other words, a process is either resident in memory or
forced out of it in entirety. Examples are PDP-11 and early UNIX systems.
(4) In a demand paging system, individual pages rather than entire processes can be
swapped between main memory and disks independently. A page is brought into
memory only when it is demanded. Demand paging has been implemented in
recent releases of UNIX system.
(e) Hybrid memory systems use a combination of swapping and demand paging in
managing the memory system. Examples include VAX/VMS and UNIX System,
v.
Problem 5.20
(a) Lamport’s definition of sequeritial consistency (SC) gets rid of the concept of 2
global clock and relies solely on the ordering of events. The concepts of program76
(b)
(e)
Bus, Cache, and Shared Memory
order and memory order form the foundation of various memory consistency models,
developed subsequently. The conditions given by Dubois et al. are sufficient but
not necessary conditions for iinplementing SC. The definition is centered around
the abstract notion of memory operations in one processor being “performed with
respect to other processors”. Sindhu et al.'s definition is more formal. A set of
axioms based on the mathematical notions of total and partial ordering are used to
rigorously specify the behavior of memory systems that satisfy SC. It also defines
an atomic swap operation for the implementation of test-and-set which is used
to guarantee mutually exclusive entry to critical sections. The similarity among
the three SC models is the total ordering of memory events and the obedience of
program order within each processor.
‘The DSB model of weak consistency (WC) imposes SC on synchronization opera-
tions only. Other store and load operatious are allowed lo proceed without waiting
for the completion of one another as required by SC model. This allows a higher
degree of parallelism to be realized. TSO model imposes program order only on
store-store (write after write) operations. Load operations do not have to be visi-
ble to the shared memory provided they can be satisfied by a corresponding store
operation in the write buffer. TSO model also allows a load operation to bypass
write operations,
PSO is derived from TSO by distinguishing store operations performed by an
individual processor. In TSO, all the store operations in a processor have to be
carried out in program order. But in PSO, only two types of store operations need
to be performed in program onder: (1) Store operations explicitly separated by a
store barrier (Stbar) in the program; (2) Store operations performed to the same
memory location. In other words, stores which are to different memory locations
and are not separated by Stbars are allowed to be executed out of program order.
‘As such, the write buffer in each processor is no longer a FIFO queue, This is
similar to DSB weak consistency model and is likely to increase parallelism. The
drawback is that the programmer has to determine where strict program order has
to be followed and inserts Stbars in those places.Chapter 6
Pipelining and Superscalar
Techniques
Problem 6.1
(a)
nk 15000 x5 _ 75000
E¥(n—1) ~ 5+ (15000—1) ~ 15004
Speedup =
4.9986.
(b)
[ke + (m ~ 1)} = 15000/25004 = 0.9997.
‘Throughput = nf /{k+(n—1)] = 15000x25x10° (instructions/s)/15004 = 24.99 MIPS.
Problem 6.2
(a) The clock frequency of DEC Alpha is 150MHz. Comparing it with the 25MHz in
‘a base machine, the superpipeline degree is 6, Alpha issues two instructions every
eycie. Therefore, its superscalar degree is 2.
(b) Alpha has a huge virtual address space. Virtual addresses are 64-bit long, Alpha
provides instructions for synchronization and cache coherence. This makes it, suit-
able for building multiprocessor systems. However, the scalability of multiprocessor
systems is lower than that of multicomputer systems,
Problem 6.3
(a) The superpipelined structure has extra startup overhead and higher branch penalty.
See the original paper by Jouppi and Wall (1980).
vr73 Pipelining and Superscalar Techniques
(b) Under steady state, a superpipelined machine of degree m and a superscalar ma-
chine of degree n both can execute n instructions simultaneously. The super-
pipelined machine outputs one result every 1/n clock cycle, while the superscalar
machine outputs n results every n clock cycles.
Problem 6.4 The performance cost ratio can be expressed as
1
POR= Tale RR)
Maximizing PCR is the same as minimizing its inverse. Let ko be the optimal number
of pipeline stages. Then |
m (ren) |
oe \PCR)|,,
whence, 7 t
wagle thot) + (G+ Mh= 0,
After some simplification, we get
et
Bre
and
Note
era te 4
am \ PCR), ~
‘Therefore, at ky, 1/PCR is minimum and PCR is maximum
Problem 6.5 Lower bound of MAI, = the maximum number of checkmarks in any
row of the reservation table. Upper bound of MAL = the number of 1's in the initial
collision vector plus i. Detailed proof can be found in the paper by Shar (1972)
Problem 6.6
(a) Forbidden latencies: 1, 2, and 5. Initial collision vector: (10011).
(b) State transition diagram:
Cit
(c) MAL =3
(a) Throughput = 2 = 16.67 million operations per second (MOPS).Pipelining and Superscalar Techniques 79
(e) Lower bound of MAL = 2, The optimal latency is not achieved.
Problem 6.7
(a) Reservation table:
(>) State transition diagram
{c) Simple cycles: (4), (5), (7), (81); (3,4), (8,5,4), (3.5.7), (1,7), (6,4); (5.7), G7),
(2,84), (2,3,5,4), (1,8,5,7}, (13,7), (1943), (Lets4), (14,7), (5:3,4), (5,3,7), (5,3,1,7)
Greedy cycle: (1,3)
(a)
143
MAL = >
fe)
‘Throughput = 5
Problem 6.8
(a) We can complete the computation in N-+11 clock cycles by the following sequence:80 Pipelining and Superscalar Techniques
* cyde 1 Compute 4, +0. Feed A; to X and 0 to Y. Connect X and
Y to the inputs of the adder
# cycle 2: Compute Az +0. Feed Az to X and 0 to Y.
+ cycle 3: Compute As +0. Feed As to X and 0 to Y.
cycle 4 Compute Ay +0. Feed A, to X and 0 to ¥.
© cycle 5: Compute A; + As. Switch the lower switch to feed Z to the
lower input of S1 from now on, and feed As to the upper input.
# cycle 6: Compute Az + Ag, Feed Ag to the upper input of S1.
© cycle 7: Compute Ay + Ar. Feed Ar to the upper input of $1
# cyde & Compute Ay + As. Feed Ag to the upper input of SI
* cycle 9: Compute Ai + As + Ao. Food Ay to the upper input of SL
cycle 10; Compute Az + Ag + Aro. Feed Ap to the upper input of SI.
scyde 1: Compute As +Ar + Ans. Feed Ai: to the upper input of SI.
+ cyde 12: Compute Ay + Ay + Anz. Feed At2 to the upper input of SI.
* cycle N-3: Compute Ay +Ag+A+...tAwzs. Feed Ay-s to the upper
input of SL
cycle N~2 Compute Ay + Ag + Ayo +... + Ay-2. Feed Aya to the
upper input of $1.
s cycle N—1: Compute As + Ar+ Ai +. + Aya. Feed Ay-1 to the
upper input of SI.
cycle N: Compute Ay + As + Aig +... + Ay. Feed Aw to the upper
input of SI.
e cycde N41: Store 7 (= Ay + As +... + Ay_s) to R and switch the upper
switch to input R to was epper input of 1 from now on.
# cycle N-+2: Compute Ai +ds+Aot..tAn-stArtAs+ Aiob ct AN-2,
e cycle N+3: Store Z (= As + Ay +... + Ay-1) to R.
s cycle N44: Compute Ast Art Anton Avast Act Ag | Aa bo Ane
© cyde N+ 5:
# cycle N+6: Store Z (= Ai + Ap + As + Ag +... + Awa + Aw-1) to R,
# cycle N47:
cyde N+8: Compute Ai + As + As +o. + Avast Aa + Ag + Ato +.
$Ayat Ast Art An tot Anat Ant Ae t An tot Ay,
cycle N +8:
cycle N +10:
cycle V+ AL
cycle NV | 12: Result output from % which is the snm of all elements of A.
(b) The N values are fed sequentially to a nonpipelined adder. Therefore, N& cycles
are needed. The speedup is
Nxk
“N¥i1
5A)Pipelining and Superscalar Techniques 81
For N = 64andk=4
64 x4
* eayit
Among the N-+12 cycles, 8 cycles (N+1,.N+3,N+5,N+6,N+7,N+9,N-+10,
and N +11) issue useless instructions. Therefore, V +3 useful add instructions
are performed. The efficiency is
54(64) =341
N+3
Nei’
m(N) =
(64) = 67/75 = 0.89.
(ec)
mu(00)
(4)
Sa(Mij2) = S400) /2.
4A(Nij2)
Mati
Mau.
Problem 6.9
(a) Forbidden latency: 3; collision vector: (100)
(b) State transition diagram is shown below:
(c) Simple cycles: (2), (4), (4,4), (1,14), and (2,4); greedy cycles: (2) and (1,1,4)
(4) Optimal constant latency cycle: (2) ; MAL = 282 Pipelining and Superscalar Techniques
1
Throughput = 555 — = 25 MOPS.
Problem 6.10
(a) Forbidden latencies: 3, 4, 5 ; collision vector: (11100).
(b) State transition diagram is shown below:
(c) Simple eyctes: (1,1,6), (2.6), (6), and (1.6).
(€) Greedy cycle: (1,1,6)
{e) MAL = 141+ 6/3= 267.
(£) minimum allowed constant eycle: (6).
(g) Maximusn throughput =
(a) 1/(6r).
Problem 6.11 The three pipeline stages are referred to as IF, OF, and BX for instrue-
tion fetch, operand fetch, and execution, respectively. The following diagram shows the
sequence of execution:
ola ds «[s[s
[te | mta_| sive | adé | sie
k
oF RO {Aze,R0| ce [Acs RO} Ace
©
Ex wo Paw Fear ace Bae
At ty, O()) MIUp) = (RO) — RAW hazard.
At t4, OU) O1Us) = {Acc} — RAW hazard.
At ts, O(a) 9 IUs) = {Acc} —+ RAW hazard.
‘The following shows a scheduling which avoids the hazard conditions:
;
|Pipelining and Superscalar Techniques 83,
upe} el uel s&s} & i} oe]
wpm tet feeb aw | tame
oF ~ |S a Scrol Nae
7 w]e w Bie r
Problem 6.12
(a) For the given value ranges of m and n, we know that mn(N—i) >N-1>N—m.
Now, Bq. 6.32 can be rewritten as
mn(N = 1) + mnk
Simm) = Taal + mk
From elementary algebra, we know that the right hand side of the above equation
will attain the largest value when the term mark is smallest. As a result, the value
of k should be 1 in order to maximize S(m,n).
(b) Instructional level parallelism limits the growth of superscalar degree.
(c) The multiphase clocking technique limits the growth of superpipeline degree.
Problem 6.13
* Solution 1
(a) Reservation table:
(b) Forbidden latency: 4. Collision vector: (1000).
(c) State transition diagram:Pipelining and Superscalar Techniques
(a) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2.5), (1,2,8.5), (1,2,3,2,5), (1,2.3,2,1,5),
(2,5); (24,5), (21,258), (2,1,2,3,5), (2,3,5), (8,5), (3,2,5), (3,2,1,5), (,3,2,1,2,5),
(5), (3,2,1,2), and (3).
(e) Greedy cycles: (1,1,1,5) and (1,2,3,2)
PH14i +5
(f) Map =A22454°
(g) Maximum throughput = 1/(27).
* Solution 2
(a) Reservation table:
(b) Forbidden latency: 2, 4. Collision vector: (1010).
(c) State transition diagram
S)
1010
t/1\N\
Sint ys
(a) Sinsple cycles: (3), (5), (1,5), and (3,5).Pipelining and Superscalar Techniques 85
(e) Greedy cycles: (1,5) and (3).
(f) MAL =3.
(g) Maximum throughput = 1/(3r)
Problem 6.14
(a) The complete reservation table for the composite pipeline is as follows:
123 45 67 8 9 WM
x
x
xX
(b) Forbidden latencies: 8 1, 7, 9, 3, 2. Collision vector: (111000111).
(c) State transition diagram:
moon
(d) Simple cycles: (5), (6), (10), (4,6), (4,10), (5,6), and (5,10). Greedy eycles: (5)
and (46).
(e) MAL = 5.
(1) Maximum throughput = 1/(6
Problem 6.1586 Pipelining and Superscalar Techniques
(a) X needs 400 (= 4 x 100) cycles to execute the program. It takes 16000 ns (400 x
40 ns). Y needs 104 { = 5 + {160—1]) eycles to execute the program. It takes 5200
1.
Speedup = 18000 _ 5.08,
5200
(b) MIPS rates are computed as follows:
200,
Xs Jgyy = 625 MIPS
a = 19.2 MIPS.
Bas
Problem 6.16
(a) The five-stage multiply pipeline is depicted below:
2 ”Pipelining and Superscalar Techniques 87
(b) The maximum clock rate is r= Tm +d = 90+20 = 110 ns,
(c) The maximum throughput = 1 / (110 ns) = 9.1 MOPS.
Problem 6.17
(a) 1. Exponent subtract
2. Align
3. Fraction add
4, Normalize
(b) From the solution of Problem 6.8, 111 clock cycles are needed.
Problem 6.18
(a) The composite pipeline:
agate
368) aie | 43mge M23)
‘nade
ba
(b) Connection of the third adder:
—
fsuge p>Chapter 7
Multiprocessors and
Multicomputers
Problem 7.1 Since requests are continually generated by the processors during each
cycle, the bus never becomes idle. The memory requests are uniformly distributed
across all the modules. Thus the probability that a memory module is selected is 1/m_
in each cycle. After a memory module is selected, it will be busy for ¢ cycles. Then it
may be reselected or become idle for a number of cycles. The behavior of a memory
module can be described by the following diagram:
‘Address latch in (1 eycie)
ae \N
ne ae 7 oer feel
‘The idle or waiting period can be modeled by a random variable x which follows a
geometric distribution. The mean value of x ean be compated as follows:
=do-sydy
Let 7
Lydia
Je) = ze - a
From the theory of z-transform, we know that
f(z)
dz
8990 Multiprocessors and Multicomputers
Let p=1/m and 9
whence,
R= s'()=
G-y #
(a) The memory bandwidth delivered by the bus configuration is
me
(e+ m)r
Using the given values for the variables, we obtain
7 x4
© Gea) x10 x 10>
= 32 x 10" words / sec.
(b) The fraction of time during which a memory module is busy is
ctltm
“orm
Since there are m i
modules is
dependent memory modules, the utilization of all the memory
4x16
446
= 3.2.
requests per memory cycle,
Problem 7.2
(a) ‘The following diagram shows the crossbar network which connects m processors to
memory modules.Multiprocessors and Multicomputers o1
aol wil
8 feo ee
ie
_ afl uh
‘The complexity of the crossbar network can be estimated as follows. At each
crosspoint, there are 2 AND gates and 2 OR gates, But in the last row (processor n)
and last column (memory module n}, we do not need OR gates for she read /write
operation. Therefore. there are 2n? AND gates and 2n? —2n OR gates. In practice,
each AND or OR gate in the diagram consists of w wu-input AND or OR gates
as shown in the following diagram,
In total, the number of two-input AND gates is 2n%w, and the number of
two-input OR gates is (2n? — 2n)w
(b) The schematic diagram of the arbiter is shown below:92 Multiprocessors and Multicomputers
Aten; Aste fy tes
y ¥
In case of conflicting requests to access the same memory module, the arbiter
will grant priority to the processor with the smallest number. There are (n — 1)
two-input AND gates along each column, leading to a total requirement of n(n —1)
such gates.
Problem 7.3
(2) The mappings are shown in the following figure.
Zs
Main memory Main menory
jefe f-le
4
+|~|-[ol §
5
Diet mapping (i) Toss associative mingMultiprocessors and Multicomputers 93
(b) The results are shown in the following two tables; the first table corresponds to
direct mapping and the second two-way set-associative mapping, In each table, an
arrow connecting the same block numbers indicates that the corresponding access
takes more than one cycle due to read/write misses or bus contention. In any case,
at most 3 cycles are required to complete an access in the case of a read/write miss
coupled with bus contention. The subscript associated with a block indicates the
state of that block (R for read-only, and W for read-write.)
feyele TpeTs aps
‘block tracelo— of 0] 01 =
Frame O_|—[O, [Ow {Ow
frame 1 _[=|—|—|—|-~
ipafframe 2 =|—|— |= |=
frame 3 ==
Jeache miss] = o z
[bus in use [» = *
[block trace]? — 2) 2/0 — O10
frame 0 ==
lame T
P2lframe 2
frame 3
‘cache miss | ¥ = wT Le
wusinuse| fe] fe] |e ep.
7] ]9 fiopiipis
Tiss as
On |— Tan lin [aa de
vw flue
SalSa
14 [is [a6 [17]
3
7,
=> 3/8
Fein lt
=a
ie
[Sw
iE
4
|
fl
318,
|
ART
1
TS
—hspat S|
0.
0,
[2
Foe
a
aINe SI
IP
IE
cycle
[block trace]
lirame 0
lirame 1
frame 2
frame 3
lcache miss
‘bus in use
ie
iP:
iE
SPIES]
Iblock trace]
Hrame 0
ramet
Palframe 2
| [frame 3
Jeache mise
[busin use | P+ |
Ie
is
Ie
aStsSyall
STSTN]
Ee SS
el tet
For the given page reference patterns, the hit ratio is 6/11 for Pl and 7/11 for
P2 with either cache organization. ‘The major difference is the contents of block
frames in the caches due to different ways of mapping between memory and cache.
As can be seen, a memory block can possibly reside in more cache block frames
with the set-associative organization, which generally improves hit ratio.94 Multiprocessors and Multicomputers
Problem 7.4 A valid schedule must satisfy the following two conditions:
(1) It does not violate the dependence relation specified in the diagram.
(2) It does not cause resource conflicts. In other words, no processor or memory
module can be allocated to more than one segment at a time,
‘There are many possible ways to schedule the program segments without violating
the above conditions. A systematic approach is to use list scheduling as discussed in
{Adam74]. The heuristics is to identify a critical path based on the memory latency
and schedule segments on the critical path first under the data dependence and resource
constraints.
Py is demanded the most among all processors and is busy for 20 time steps.
Moreover, none of the memory modules are requested more than 20 times. as shown in
the following table:
Memory module | Access frequency
M
My :
Ms 15
My 12 |
Ms 8
Ms
| Total 70
‘Therefore, the demand on Ps preciudes the possibility of a scheduling that can
finish the (ask in less than 20 time steps. In the following table, we show one possible
scheduling. Bach cell of the table contains a pair of numbers x,y with z corresponding
to the instruction and y the memory module requested. According to condition (2) in
the above, the value of y should be different in cells on the same row.Multiprocessots and Multicomputers 95
(mesa TR TR TAT
r iz 33 [iLL
a 22 [115 | 3,3
3 ap 22 [U1
4 ai [93 fis
3 LL 04 | 22
6 Tos [61 | 94
7 10.4 | 6.3
8 83 [10.4
9 ior 76 | 75
10 a2 | 44
ft 45 52 [13.6
i} 162 Was [144
13 | St [6,2 | 20,5 | 16.4
[| 20.3 | 16.2 | 204
15 12,2 | 12.6 | 12.5
16 | 223 | 183 | 185
17) 21.2 [221 | iss [ad
is] 153 19.2 | 22.6
19 | 1927 15.5 19.1
20 19.2 | 285 | 23.5,
| BA | waa Tas [PAA
Based on the scheduling, 21 time steps are required and the average memory band-
width is 70/21 = 3.33 words per time step. There are other schedules that yield an
identical bandwidth. For instance, the pair 1.4 on row 3 can be moved to the same
column on row 1
Note that in the above scheduling, an additional condition is satisfied; that is,
once a segment is scheduled, it is continuously scheduled in consecutive time steps until
completion without being disrupted. If the condition is relaxed, it is possible to obtain
a scheduling with a total of 20 time steps
Problem 7.5
(a) Three switch cells are needed to combine the inputs a5 shown in the following
diagram:
a
—
a
Each switch box is able to perform the following functions (see [Stone90), p.
348):Multiprocessors and Multicomputers
Match the addresses on upper and lower inputs.
Add the two increments.
Save one of the increments.
Match a returning value for Fetch&Add to the saved increment.
(b) The following figure shows a possible scenario of the data transfer between proces-
sors and a certain memory module (hot spot).
|
a |=
ed Memory
nowy HEE
eves} ara] [| *
Meany
wai i)
Co
CEP oes on
ees a
i . | ae Trenance,
HF
“SJL voy
i ic Teepe,Multiprocessors and Multicomputers 97
The final content of the memory location is the same regardless of the se-
tialization of the increments ¢;. But the increment saved in the buffer of each
‘switch cell can be different, resulting in different values being returned to different
processors:
Problem 7.6 The m-way shuffle of n objects, where n = mbis defined by the following
ror plot) me
Qk+b with O< & ko, the demand for synchronization lines will exceed m. Therefore, the degree of
multiprogramming should not exceed ko
Problem 7.9 An important property of the multilevel bus/cache architecture is that
any memory block which has a copy in the level-1 caches also has a copy in the level-2
cache, This inclusion property makes it possible to use level-2 caches as filters to avoid
unnecessary traffic on the buses.
Consider the use of a write broadcast protocol to maintain cache coherence of the
system in Fig. 7.3, When a level-1 cache C;; writes to a memory block, the updated
value is broadcast on the intracluster bus so that the other caches which have a copy of
the memory block will update their data.
‘The updated value is also propagated up to Czo, which updates its copy of the
memory block. Cao then broadcasts the new value on the intercluster bus. If a copy of
the block exists in Czo (for instance), its value is updated. By the inclusion property,
the memory block is likely to be resident in the cluster underneath Cz. Therefore, the
data is also passed down to the intracluster bus and level-1 caches which have a copy
of the memory block also update their values.
‘The relative merits of write-invalidate and write-broadcast protocols have been
studied extensively, through either simulation or analyti¢ approach [Archibald86, Yang89].
See the discussion in the solution for Problem 7.19 below. Most comparison has been
conducted on single-level caches, but the results should be applicable to hierarchical
caches. Write-broadcas: protocols generally exhibit better performance, although ac-
tual performance depends on the memory reference patterns. An advantage of write-
invalidate protocols is the relatively simple hardware support required.
Problem 7.20
(a) The general trend in the industry is toward open systems which favor commercially
available processors over proprietary design. ‘This helps reduce the cost and shorten
the time of development. More effort can be focused on high-level design such as,
interconnection structure and software development.
(b) Increasing scalability is the main motivation.Multiprocessors and Multicomputers 101
(c) To avoid the problems of memory contention and/or cache inconsistency (if private
memory or cache is used).
(d) To offer more flexibility in using existing multiprocessor software.
Problem 7.11
(a) A message is the logical unit for internode communication, It is often assembled
by an arbitrary number of fixed-length packets. It may have a variable length.
A packet is the basic unit of information transmission which contains the destina-
tion address for routing purposes.
‘A flit (flow control digit) is the smallest unit of information that a queue or channel
can accept or refuse.
(b) In a store-and-forward network, the basic unit of information flow is a packet.
Each node has a packet buffer. A packet is transmitted from a source node to a
destination node through a sequence of intermediate nodes. When a packet reaches
an intermediate node, itis first stored in the buffer. Then it is forwarded to the
next node if the desired output channel and a packet buffer in the receiving node
are both available.
(c) In wormhole routing scheme, a flit is the basic unit of information fow. Flit buffers
are used in the hardware routers attached to uodes. The transmission from the
source node to the destination node is done through a sequence of routers. All the
fits in the same packet are transmitted in order as inseparable companions in a
pipelined fashion. Only the header flit knows where the packet is going. All the
data flits must follow the header flit. Different packets can be interleaved during
transmission. However. the fits from different packets cannot, be mixed up.
(4) A virtual channel is a logical link between two nodes. Tt is formed with a flit
buffer in the source node, a physical chanuel between them, and a flit buffer in
the receiver node. There are more than one virtual channels between two nodes.
However, fewer number of physical channels are time-shared by all the virtual
channels,
(e) Buffer deadiocks may occur with store-and-forward routing in which no buffers are
provided on the channels. A deadlock sitwation occurs when there is a circular wait,
among the nodes and the buffers in the nodes are all full. Channel deadlock can
‘occur with wormhole routing when the channels used by different messages enter
a circular wait. Both types of deadlocks are illustrated in Example 7.2
(£) When two packets seach the same node and they request the same outgoing chan-
nel, the cut-throngb routing scheme uses a packet buffer to temporarily store one of
the received packets. When the channel becomes available later, the stored packet
will be transmitted then.
() When two packets reach the same node and they request the same outgoing chan-
nel, the blocking policy blocks the second pocket from advancing, However, the102 ° Multiprocessors and Multicomputers
packet is not abandoned.
(h) When two packets reach the same node and they request the same outgoing chan-
nel, the discard policy simply drops the packet being blocked. Packet retransmis-
sion is required when the channel is available later.
(i) In detour flow control, the blocked packet is rerouted to a detour channel. From
there, another route may be found to reach the destination node.
(J) © A virtual network is a network in which all nodes are connected by virtual
channels. There are multiple virtual channels between two nodes. Hence,
several virtual networks can be formed.
+ ‘The nodes in a network can be subdivided into several subsets. The nodes
in a. subset and their connections form a subnetwork of the original network.
Problem 7.12
(a) A 16x 16 Omega network using 2 x 2 switches is shown below:
00 1 oxo
coor —
caro }— «ro
ont [ton
1 }— cin
io: } _— ae)
ous }— ono
[— om
10 }-— 00
1001 — [ood
f}
sag —/ / }— ino
@ fon
1100 ~ i = —— 1100,
ner ior
who J ino
i nt‘Multiprocessors and Multicomputers 103
() 1011 — 0101 is indicated by — in the above diagram; 0111 ~+ 1001 is indicated by
-. As can be seen, there is no blocking for the two connections
(¢) Bach switch box can implement two permutations in one pass (straight or cross).
‘There are log, 16 x 16/2 switch boxes, Therefore, the total number of single-pass
permutations can be computed as
gf xlog, 16 _ 982
168
‘The total number of permutations is 16!, therefore,
Narer of single pass permmtations 16% 9 0o soa
‘otal number of permutations > Tel ~ 20510
(d) At most log; 16 = 4 passes are needed to realize all permutations.
Problem 7.13
(a) A unicast pattern is a one-to-one communication, and a multicast pattern is a
one-to-many communication.
{b) A broadeast pattern is a one-to-all communication, and a conference pattern is a
many-to-many communication,
(c) ‘The channel traffic at any time instant is indicated by the humber of channels used
to deliver the message involved.
(4) The communication latency is indicated by the longest packet transmission tine
involved
(e) Partitioning of a physical network into several logical subnetworks. In each of the
subnetworks, appropriate routing schemes can be used to avoid deadlock
Problem 7.14
(a) (101101) — (101100) (101110) ~+ (201010) ~ (111020) ~» (011010).
(b) Two optimal routing schemes under different constraints:
© Routing with a minimum sumber of channels:104 ‘Multiprocessors and Multicomputers
SOOOSOGO®@
®
©®G@OOGO2O
For this routing, traffic = 20, distance = 9.
‘* Routing with a minimum distance from the source to each destination:
@
®©®
@® &
loROMO MORO)
®
® &
©
©
@@SGG008
For this routing, traffic = 22, distance = 8.
‘There are other routes with the same traffic and distance.
(c) The routing is shown in the following tree:Multiprocessors and Multicomputers 105
oN.
Are
oso Hat not
101
‘The paths are shown by heavy lines in the following diagram:
Problem 7.15
(a) In a hypercube of dimension n, we denote a node as nz where k is an n-digit
binary number. Node m has n output channels, one for each dimension, tabeled
Cok, +++ €(n-2yk- The E-cube algorithun routes in increasing order of dimension. A
message ariving at node n, destined for node ny is routed on channel cy, where i
is the position of the least significant bit in which k and 1 differs. Since messages
are routed in order of increasing dimensions, and hence increasing channel sub-
scripts, there are no cycles in the channel dependency graph and E-cube routing
is deadlock free.
(b) There are four possible X-Y routing patterns corresponding to the east-north, east-
south, west-narih, and west-south paths chosen. As in the 3 x 3 mesh shown in
Figure 7.37, we can have two pairs of virtual channels in the Y-dimension. For
each of the four routing pattems, no cycle will be formed in the channel dependency
graph. Thus, the X-Y routing is deadlock-free.
(c) Ina Kary n-cube, we denote the address of a node by nj where j is an n-digit,
radix k number. The ith digit of j represents the node's position in dimension
i. For example, the center node in the 3-ary 2-cube below is mi. A channel is106 Multiprocessors and Multicomputers
identified by the address of its source node and the dimension it is in. For example,
the dimension 0 (horizontal) channel from mix to nio is co1r
To break cycles we divide each channel into an upper and a lower virtual
channels. The upper virtual channel of co, is labeled coy11, and the lower virtual
channel is labeled coo1). In general, virtual channel subscripts are of the form dur
where d is the dimension, v selects the virtual channel, and z identifies the source
node of the channel. To assure that a routing is deadlock free, we restrict it to
routing through channels in order of ascending subscripts.
‘Asin the E-cube algorithm we route messages in increasing order of the dimen-
sions, starting with the lowest dimension. In each dimension , a message is routed
in that dimension until it reaches a node whose subscript matches the destination
address in the ith position. The message is routed on the upper channel if the ith
digit of the destination address is greater than the ith digit of the current node's
address. Otherwise, the message is routed on the lower channel. This algorithm
routes messages in order of ascending subscripts. Thus, it is deadlock-free.
Ce t TD
Doo
Problem 7.16
(a) The turn model works by prohibiting @ minimum number of tums (change of
Girections by 90 degrees) to prevent the formation of cycles. With the cycles
broken, circular waits are removed and deadlocks are prevented. Formally, routing
algorithms developed under this model allow a channel-numbering scheme in which
the channels traveled by cach packet either increase or decrease monotonically.
‘This type of routing has been shown to be deadlock-free. See also the solution for
Problem 7.15.
(b) The authors have described three different routing algorithms for use with n-
dimensional meshes: all-but-one-negative-first,all-but-one-positive-last, and negative.
first. ‘These algorithms specify that a packet should use outgoing channels along
certain directions before (or after) the others. As stated in (a), the algorithmsMultiprocessors and Multicomputers 107
have to be used in conjunction with special channel numbering schemes. For more
details, see the paper.
(c) A E-ary n-cube uses a torus connection along each dimension; ie., each node at
the edge of a mesh has @ wraparound connection. One way to use the algorithms
developed for meshes is to assign to each wraparound channel a number greater
{or smaller) than any other channel along that direction in the mesh, depending
on the routing algorithm used.
Problem 7.17
(a) In multicast, the objectives are two-fold. One is to send a message to all the
destination nodes, and the other is to do so efficiently. A tree can be constructed
and used to determine the minimum subtree which covers all the destination nodes.
‘This is illustrated in the following diagram using the multicast pattern in Example
78.
oa oon ona
‘The destinations are enclosed in boxes. To cover all the destinations from the
source with a minimum number of edges (lowest traffic), the paths indicated by
heavy lines are chosen, which are identical to the choice of the greedy algorithm
in Example 7.8. The path has a latency of 4 and a traffic of 10. Note that. there
are other alternatives to some of the nodes/edges selected. For instance, 1001 can
be used instead of 1111, and destination 1010 can be reached from 1011 instead of
110.
(b) ‘The greedy multicast algorithin provides a strategy to deterministically select in-
termediate nodes (called forward nodes in the paper) between the source and desti-
nations, The selection is based on the distance between the addresses of the source
s and a destination dj, which is the number of Is in ry = # @ dj, where © stands
for bitwise exclusive-OR operation,
‘The design of the algorithm is such that each intermediate node on the path
from § to 4, will reduce the number of Is in 7, by 1. In fact, the descendant
nodes of each intermediate node are chosen according to the number of destination108 ‘Multiprocessors and Multicomputers
nodes for which the goal is achieved. Therefore, if initially s and d, differ in 6 bit
positions, the message will arrive at d; in 6 steps, which is the minimum possible
mumber of steps on a hypercube.
‘The authors proved that the greedy algorithm also minimizes network traffic if
the number of destinations is 1 or 2, but isslightly iitferior to the optimal algorithm
when the number is larger than 2.
Problem 7.18 In the write-once protocol, a block may exist in one of four states in
cache:
«Invalid: there is no copy of the block in the cache,
Valid: an arbitrary number of caches can have this read-only block, and all the
copies are identical,
‘© Reserved: data in the block has been locally modified exactly once sinee it was
brought into the cache and shared memory is updated, and
Dirty: data in the block has been locally modified more than once since it was
brought into the cache and the shared memory is stale,
‘The write-once protocol is mainly characterized by the introduction of the Reserved
state. A first-time write to a clean and potentially shared block results in a write-through
to memory and it updates the main’ memory copy as well as the local copy. ‘The local
copy becomes Reserved, which indicates an exclusive copy in the system and saves
subsequent write invalidations.
Each cache has a buswatcher which monitors the transactions on the bus. When
the bus watcher detects an address on the bus which hits in the local cache with a
dirty copy, it intervenes in the bus transaction by asserting the memorybypass signal to
inhibit the memory from supplying the data. To facilitate rapid access to the address
tags and state bit pairs concurrently with accesses to the address tags by the CPU, dual
(identical) cache directories are used.
Problem 7.19
‘There are five states for cached blocks in the Dragon protocol: Invalid, Valid-
Exclusive (only cached copy in the system; clean and identical with the memory copy),
Shared-Clean, Shared-Dirty {write back required at replacement) and Dirty (only copy
in caches and modified).
‘The Dragon protocol is a write-broadcast protocol as the Firefly protocol. As long
as there exists more than one cached copy, writes are broadcast to other caches. One
difference is the updates to shared blocks are also immediately reflected at main memory
in the Firefly protocol, while the Dragon protocol introduces the Shared-Dirty state such
that memory copy is updated oaly when the Shared-Dirty copy is replaced, The cache
that performed the latest write to the shared block is in the Shared-Dirty state and is.
responsible for supplying the block on misses in remote caches and for updating main
memory on replacement
In case of write hits on unmodified private blocks the Dragon and the Firefly are
able to eliminate unnecessary overhead by changing cache state from Valid- Exclusive to‘Multiprocessors and Multicomputers 109
Dirty without inducing any bus transaction. On the contrary, the write-once protocol
requires a single word to be written to main memory.
‘The distributed write protocols of Dragon and Firefly yields better performance
than the write-invalidation of write-once protocol in the handling of shared data. This,
is because the overhead of distributing written data to all caches having a copy is lower
than repcatedly invalidating all other copies and subsequently forcing misses on the
next references in those caches where the block was invalidated.
‘The performance of the Dragon can slightly exceed that of the Firefly protocol be-
cause the Firefly broadcasts writes to main memory as well as to other caches. There-
fore, the performance of the Firefly may be affected by the long latency of the memory
system. But the Dragon gains the performance at the cost of adding one more state
Shared-Dirty and it becomes more complex compared to the simplicity of the write-once
protocol.
Problem 7.20
(a) When more than one input of a crossbar module wants to use the same output
port, the output connection is granted to the input port with the smallest num-
der. In Cedar implementation, there is a priority resolution logic in each output
port. An arriving packet waits in the input queue if the output port is already
busy or the input is not chosen by the resolution logic. Only when all currently
conflicting requests have been resolved will any new request be allowed to enter
the arbitration. In this fashion, high-priority input ports will be prevented from
starving low-priority ones, In summary, a combination of first-come first served
‘queueing principle and a fixed priority based on input port number is used to
resolve conflicts. See {Konicek9i]
(b) See Fig, 7.10a in the text for a similar connection of a 64 x 64 network using 8 x 8
switch modules
(c] See Fig. 7.10b in the text for a similar connection of a 812 x 512 network using
8 x 8 switch modulesChapter 8
Multivector and SIMD Computers
Problem 8.1
(a) In the register-to-register architecture, operands and results are retrieved indirectly
from the main memory through the use of a large number of vector or scalar
registers. In the memory-to memory architecture, source operands, intermediate
and final results are retrieved directly from the main memory. More registers
are needed in a register-to-register architecture, and higher memory bandwidth is
needed in the memory-to-memory architecture.
(b
An SIMD machine with n processors and a pipelined machine with m stages and
1/m clock period have the same performance {n results every basic cycle). However,
the SIMD machine needs n times of hardware (ALU), and the pipelined machine
needs n times of memory bandwidth.
Problem 8.2
(a) The percentage of vector code in a program required to achieve equal utilization
of vector and scalar hardware
(b) The percentage of code in a program which can be vectorized.
(c) A compiler capable of vectorization.
(€) The instructions correspond to the following mappings:
firs.
or
92x Vj sy
(e) A gather instruction fetches the nonzero elements of a sparse vector using indices.
fi M AV x Va
a112 Multivector and SIMD Computers
A scatter instruction stores a vector in a sparse vector whose nonzero entries are
indexed.
PM xoM.
(£) A sparse matrix is a matrix ia which most of the entries are zero. A masking
instruction uses @ mask vector to compress or expand a vector to a shorter or
longer index vector, respectively, corresponding to the following mapping:
$Me XV Vi
Problem 8.3
(a) The low-order interleaved memory can be rearranged to allow simultaneous access,
or S-access, as illustrated in Pig, 8.12. In this case, all memory modules are accessed
simultaneously in & synchronized manner. Again the high-order (n — a) bits select
the same offset word from each module.
At the end of each memory cycle (Fig. 8.1), 7m = 2* consecutive words are
latched in the data buffers simultaneously. The low-order a bits are then used to
multiplex the m words out, one per minor cycle. If the minor cycle (7) is chosen to
be 1/m of the major memory cycle (9), then it takes two memory cycles to access
1m consecutive words
However, if the access phase of the last access is overlapped with the fetch
phase of the current access (Fig. 8.1b), effectively m words take only one memory
cyele to access. If the stride is greater than 1, then the throughput decreases,
roughly proportionally to the stride.
(b) The m-way low-order interleaved memory structure shown in Figs. 8.2a and 83
allows m memory words to be accessed concurrently in an overlapped manner.
This concurrent access has been called C-access as illustrated in Fig. 8.3b.
The access cycles in different memory modules are staggered. ‘The low-order
@ bits select the modules, and the high-order 6 bits select the word within each
module, where 7m = 2* and a+b =n is the address length,
To access a vector with a stride of 1, successive addresses are latched in the
address buffer at the rate of one per cycle. Effectively it takes m minor cycles to
fetch m words, which equals one (major) memory cycle (8), as shown in Fig. 8.3b.
If the stride is 2, the successive accesses must be separated by two minor cycles
in order to avoid access conflicts. This reduces the memory throughput by one-
half. Ifthe stride is 3, there is no module conflict and the maximum throughput
(mm words) results. In general, C-access will yield the maximum throughput of
m words per memory cycle if the stride is relatively prime to m, the number of
interleaved memory modules.
(c) A memory organization in which the C-access and S-access are combined is called
G/s-access. This scheme is shown in Fig. 8.4, where n access buses are used with
1m interleaved memory modules attached to each bus. The m modules on each bus
are m-way interleaved to allow C-access. The n buses operate in parallel to allowMultivector and SIMD Computers 113
Feien eye, Access te,
Module
O +f}
— single word
‘Module rl. SoceeS
oy
| Mungtexer bea
i) +t :
nighrder
ross bs :
je
j—»{ weave Lf |
an a
‘Secrest
(a) S-access organization for an m-way interleaved memory
Meray Medes
7 Fetch Fan? Foon 3 vee
‘neces 1 necess? neceas 3
wn L_feent Fawn? Feten 3 one
‘ease 1 ‘eee? ‘ecess 3
sto |___Fan Fone Fatcng ase
‘acess # ecess2 ‘eoess 2
words words mors
yest yee? yee yas ~
(b) Successive vector accesses using overlapped fetch and avcess cycles
are B.1 The S-access interleaved memory for vector operands ac
Pi14 Multivector and SIMD Computers
Memory
MOB] Data beter |
(b) High-order m-way interleaving
Figure 8.2 Two interleaved memory organizations with m = 2* modules
and w= words per module (word addresses shown in boxes).‘Multivector and SIMD Computers 115
Memory address Register (6 bits)
Word address: Module address
MoM; ip My MgSO My
MoI GIGI Mole:
@} Ce} Go] Go) Ge] Gel Ge} fs
Le] Gr} Ge} Ge] feo) fer] Ged ps
ze] [es] [ae] Per) Pee} fae] [ao] Fa:
2] [eo] [ee] Des) [ee] Par) [ae] Feo
ao] Par] Gey ae) Pa] fa) [aed Car
|] fa} feo] Pac) [Ge] Ped feed [se
go G5 Cet Cel be [er] Cee] Fes
PoP EE fof
(=
Data Memory Data Register
(a) Eight-way low-order interleaving (absolute address shown in each memory word)
@= Major cycle
12 Gin = minor cycle
‘m= degree of intezleaving
(b) Pipelined access of eight consecutive words in a C-access memory
Figure 8.3 Multiway interleaved memory organization and the C-
access timing chart.116 Multivector and SIMD Computers
S-access. In each memory cycle, at most m-n words are fetched if the n buses are
fully used with pipelined memory accesses.
Processors Memories
©
Po
zo
&
©
A
5
zc
f
=
2
Figure 8.4 The C/S memory organization. (Courtesy of D.K. Panda, 1990)
‘The C/S-access memory is suitable for use in vector multiprocessor configura-
tions. It provides parallel pipelined access of vector data set with high bandwidth,
Special vector cache design is needed within each processor in order to guarantee
a smooth data movement between the memory and multiple vector processors.
Problem 8.4 The comparison is summarized in the following table:
Glass Architecture | Performance Cost ]
Fulk-scale ‘multiprocessor Si Glops | $2~ 25 million |
Supercomputers multi vector pipeline
pipeline chaining l
High-end mainframes or | attached vector > 200 Milops | $1 ~ T mailllion
near Supercomputers___| processor
Minisupercomputers or | multicomputer > 100 Mifops | $01 ~ 15 milion
supercomputing
workstations | _
Problem 8.5
(a) A composite function of vector operations converted from a looping structure of
linked scalar operations
|Multivector and SIMD Computers 17
(b)- © The program constrnct for processing long vectors is called a vector loop.
When a vector has a length greater than that of the vector registers, seg-
mentation of the long vector into fixed-length segments is necessary. One
segment is processed at a time.
* Pipeline chaining links vector operations following a linear dataflow pat-
tern. Vector registers are used as interfaces between functional pipelines.
Continuous data flow is maintained in successive pipelines.
(c) A synchronous program graph in which all nodes have zero delay.
(4) A pipenet is constructed from interconnecting multiple functional pipelines through
‘two buffered crossbar networks which are themselves pipelined.
Problem 8.6
(a) Figure 8.5 shows the CM-2 processor chips with memory and floating-point chips.
Bach data processing node contains 32 bit-slice data processors, an optional loating-
point accelerator, and interfaces for interprocessor communication. Each data pro-
cessor is implemented with a 3-input and 2-output bit-slice ALU and associated
latches and memory interface. This ALU can perform bit-serial full-adder and
Boolean logic operations.
‘The processor chips are paired in each node sharing a group of memory chips.
Each processor chip contains 16 processors. ‘The parallel instruction set, called
Paris, incndes nanoinstructions for memory load and store, arithmetic and logical,
and control of the router, NEWS grid, and hypercube interface, floating-point, 1/0,
and diagnostic operations.
‘The memory data path is 22 bits (16 data and 6 ECC) per processor chip.
‘The 18-bit memory address allows 2! = 256K memory words (512 Kbytes of
data) shared by 32 processors. The floating-point chip handles 32-bit operations
at a time. Intermediate computational results can be stored back into the memory
for subsequent use. Note that integer arithmetic is carried out directly by the
processors in a bit-Serial fashion.
(b) __ Special hardware is built on each processor chip for data routing among the
processors. ‘The router nodes on all processor chips are wired together to
form a Boolean n-cube. & full configuration of CM-2 has 4096 router nodes
‘on processor chips interconnected as a 12-dimensional hypercube.
Each router node is connected to 12 other router nodes, including its
paired node (Fig. 8.5). All 16 processors belonging to the same node are
‘equally capable to send & message from one vertex to any other processor at
another vertex of the 12-cube. The following example clarifies this message
passing concept,
On each vertex of the 12-cube, the processors are numbered 0 through
15, ‘The hypercube routers are numbered 0 through 4095 at. the 4096 ver-
tices. A processor 5 on router node 7 is thus identified as the 117th processor
in the entire system, because 16 x 7 +5 = 117.118
Multivector and SIMD Computers
lope lnsuton us
tot cron sottomraes |
f
|
Sane asee
ves, |EEBA! | |e, |BAae
Fyperae | ©) () [2 [) Fyperabe | CE) EI
inroes | TYE] ee) | | me | Le
=z
227 I
jacdress| Floating-Point | ap| Mastna-Point
Execution
Memory ‘and Memo
Pte] “tase” (Sing 9: Doutte
Figure 8.5 A CM-2 processing node consisting of two processor chips
and some memory and floating-point chips. (Courtesy of
‘Thinking Machines Corporation, 1990)
‘Suppose processor 117 wants to send a message to processor 361, which
is located at processor 9 on router node 22 (16 x 22+9 = 361). Since router
node 7 = (000000000111), and router node 22 = (000000010210)2, they
differ in dimension 0 and dimension 4.
‘This message must traverse dimensions 0 and 4 to reach its destina-
tion. From router node 7, the message is first directed to router node 6 =
(00000000110). through dimension 0 and then to router node 22 through
dimension 4, if there is no contention for hypercube wires. On the other
hand, if router 7 has another message using the dimension 0 wire, the mes-
sage can be routed first through dimension 4 to router 23 = (000000010111)
and then to the final destination through dimension 0 to avoid channel eon-
fi
Within each processor chip, the 16 physical processors can be arranged as
an 8x2, 1X16, 4% 4, 4x 2% 2, or 2X 2x 2x 2 grid, and so on. Sixty-
four virtual processors can be assigned to each physical processor. These
64 virtual processors can be envisioned as forming an 8 x 8 grid within the
chip.Multivector and SIMD Computers 9
‘The NEWS grid stands for the fact that each processor has a north,
east, west, and south neighbor in the various grid configurations. Further-
more, a subset of the hypercube wires can be chosen to connect the 2!?
nodes (chips) as a two-dimensional grid of any shape. For instance, 64 x 64
is one of the possible grid configurations.
Coupling the internal grid configuration within each node with the
global grid configuration, ove can arrange the processors in NEWS grids
of any shapes involving any number of dimensions. This flexible intercon-
nections among the processors make it very attractive for routing data on
dedicated grid configurations based on the application requirements,
(€) Besides dynamic reconfiguration in NEWS grids through the hypercube routers,
the CM-2 has special built-in hardware support for scanning or spreading across the
NEWS grids. These are very powerful parallel operations for fast data combining
or spreading throughout the entire array.
Scanning on NEWS grids combines communication and computation. The op-
eration can simultaneously scan in every row of a grid along a particular dimension
for the partial sum of that row, or finding the largest or smallest value, or com-
puting bitwise OR, AND, or exclusive OR. Scanning operations can be expanded
to cover all elements of an array.
Spreading can send a value to all other processors across the chips. A single-
bit value can be spread from one chip to all other chips along the hypercube wires
in only 75 steps. Variants of scans and spreads have been built into the Paris
instructions for ease of access.
(4) # In broadcasting, copies of a single item are sent to all processors. In CM-2,
this is carried ont through the broadcast bus to all data processors at once.
© Global combining allows the front end to obtain the sum, largest value,
logical OR, etc., of values, one from each processor.
© Data parallel programming provides the high-level programmer with the
illusion of as many processors as necessary; one programs as if there were a
processor for every data element to be processed. These are often described
as virtual processors.
Problem 8.7
(a) The X-Net interconnect directly connects each PE with its eight neighbors in the
two-dimensional mesh. Each PE has 4 connections at its diagonal comers, forming
an X pattern, similar to the BLITZEN X grid network (Davis and Reif, 1986).
‘A tri-state node at each X intersection permits communications with any of 8
neighbors using only 4 wires per PE.
‘The connections to the PE array edges are wrapped around to form a two-
dimensional torus. The torus structure is symmetric and facilitates several impor-
tant matrix algorithms and can emulate a one-dimensional ring with two X-Net,
steps. The aggregate X-Net communication bandwidth is 18 Gbytes/s in the largest,
MP-1. configuration,120 Multivector and SIMD Computers
(b) The network provides global communication between all PEs and forms the basis
for MP-11/O system. The three router stages implement the function of a 1024 x
1024 crossbar switch. Three router chips are used on each processor board.
Bach PE cluster shares an originating port connected to router stage $1 and
‘8 Garget port connected to router stage S3. Connections are established from an
originating PE through stages $1, S2, and $3, and then to the target PE. The full
MP-1 configuration has 1024 PE clusters, so eack stage has 1024 router ports. The
router supports up to 1024 simultaneous connections with an aggregate bandwidth
of 1.3 Gbytes/s.
(c) E. Each PE has a 4-bit integer ALU, 2 L-bit logic unit, a 64-bit mantissa unit,
a 16-bit exponent. unit, and a flag unit. All these functional units can be
simultaneously active at the same time.
2. The PE array communicates with parallel disk array through the high-speed
1/O system, which is essentially implemented by the 1.3 Gbytes/s global
router network.
Problem 8.8
(a) A fat tree is more like a real tree in that it gets thicker from the leaves. Process
nodes, control processors, and I/O channels are located at the leaves of the fat
tree. A binary fat tree was illustrated in Fig. 8.6. The internal nodes are switches,
Unlike an ordinary binary tree, the channel capacities of a fat-tree increase as we
ascend from leaves to root.
Figure 8.6 Binary fat tree.
‘The hierarchical nature of a fat tree can be exploited to give each user partition
a dedicated subtree, which cannot be interfaced with by any other partition’s
message traffic. The CM-5 data network actually implemented a 4-ary fat tree as
shown in Fig. 8.7. Each of the internal switch nodes is made up of several router
chips. Each souter chip is connected to 4 child chips and either 2 or 4 parent chips,Multivector and SIMD Computers 121
Figure 8.7 CM-5 data network implemented with a 4-ary fat tree (Cour-
tesy of Leiserson et al., Thinking Machines Corporation, 1992)
To implement the partitions, one can allocate different subtrees to handle
different partitions. ‘The size of the subtrees varies with different partition de-
mands. The I/O channels are assigned to another subtree, which is not devoted
to any user partition The I/O subtree is accessed as shared system resources. In
many ways, the data network functions like a hierarchical system bus, except with
no interference among pattitioned subtrees. All leaf nodes have unique physical
addresses,
(b) The fat tree can be subdivided into several subtrees. Bach subtree is assigned
to a user partition, Each partition consists of a control processor, a collection of
processing nodes, and dedicated portions of the data and control networks.
(c) As shown in Fig. 8.8, the basic control processor consists of a RISC microprocessor
(CPU), memory subsystem, I/O with local disks and Ethernet connections and a
CM-5 network interface. This is equivalent to a standard off-the-shelf workstation-
class computer system. The network interface connects the conizol processor to
the rest of the system through the control network and data network
Each control processor runs the CMOST, a UNIX-based OS with extensions
for managing the parallel processing resources of the CM-5. Some control proces-
sors are used to manage computational resources in user partitions, Some others
are used to manage I/O resources. Control processors are specialized in managerial
functions rather than computational functions. For this reason, high-performance
arithmetic accelerators are not needed. Instead, additional 1/0 connections are
more useful in control processors.
(4) As illustrated in Fig. 8.10a, vector units can be added between the memory bank
and the system bus as an optional feature. ‘The vector units replace the memory
controller in Fig. 8.9. Each vector unit has a dedicated 72-bit path to its attached
memory bank, providing a peak memory bandwidth of 128 Mbytes/s per vector
unit.
The vector unit executes vector instructions issued to them by the sealar
microprocessor and performs all functions of a memory controller, including gen-
eration and check of BGC (error correcting code) bits. As detailed in Fig, 8.10b,12 Multivector and SIMD Computers
ey
it
Memory inemaes
ou © 1 hi
oe
Standard Computer
¥
LAN Connection
Figure 8.8 The control processor in CM-8. (Courtesy of Thinking Ma-
chines Corporation, 1992)
Memon’ HY snes || anys || any
= tes (ee
aMbytes Hl optional) |] (optional) _]] (optional)
hs
usec)
Mem:
Controker
G4-it bus
Rework
teraz
Control Network Data Network
Figure 8.9 The processing node in CM-5. (Courtesy of Thinking Machines
Corporation, 1992)Multivector and SIMD Computers 123
Memory || Memory | | Mem Memor
BMbytes || 8 Mbytes || B Mbytes | | 8 Mbytes
ee a
Vector |] Vector |] Vector |] Vector
Unit Unit Unit Unit
64-bit bus
Network
Micro: Interface
RISC
processor
Control Data
Network Network
{a) Processing node with vector units
Bus
MBus interface
Vector Instruction
Decoder
i |
Pipelined eg Momony
Memory
(b) Vector unit functional architecture
Figure 8.10 The processing node with veetor units in CM-8. (Courtesy
of Thinking Machines Corporation, 1992)124 Multivector and SIMD Computers
each vector unit has a vector instruction decoder, pipelined ALU and 64 64-bit
registers like a conventional vector processor.
Bach vector instruction may be issued to a specific vector unit or pairs of
units, or broadcast to all four units at once. The scalar microprocessor takes
care of address translation and loop control, overlapping them with vector unit
operations. ‘Together, the vector units provide 512 Mbytes/s memory bandwidth
and 128 Milops 64-bit peak performance per node.
In this sense, each processing node of CM-6 is itself a supercomputer. Collec-
tively, 16K processing nodes can yield a peak performance of 2°* x 27 = 2° Mflops
= 2 Tops.
Initially, the SPARC microprocessors are being used in implementing the con-
trol processors and processing nodes. As processor technology advances, other new
processors may also be combined in the future. The network architectures are de-
signed to be independent of the processors chosen except the network interfaces
which may need some minor modifications when new processors are used.
Problem 8.9
(a) An example of replication: If A and B are arrays and X is a scalar quantity, the
statement A = B + X implicitly broadcasts X to all processors so that the value
of X can be added to every element of B.
(b) Besides sum-reduction, other important reduction operations include taking max-
immum or minimum, logical AND, and logical OR. The following are examples of
maximum-reduction and minimum-reduction:
Maximum Reduction
4
1 1
6 9
4 5
Minimum Reduction
1] 2] 3 i
rf o] oj a 0
so] s| of 2 2
aj 2} ats 2Maltivector and SIMD Computers 125
(c) Transposing a matrix, reversing a vector, shifting @ multidimensional grid, and
FFT butterfly patterns are all examples of permutation. Here is an example of
matrix transposition:
Matrix Transposition
1] 2] 3} 4 ry tf 6] 4
1} 0] of 1 2| of s|2
é|slel2| ~ [al olel«
4] 2] 4] 5 4] 1} 2] 5
(4) The following are examples of maximum-prefix and minimum-prefix
Maximum Prefix
4 1] 2] 3f4
1 reapala
2| | «| «| of 9
5 eLal sts]
Minimum Prefix
if 2] sf @ rafal a
i| of of 1 1] of of 0
el s|ola} ~ | 6[s| sla
4] 2] 4] s Lal 2[2]2126 Multivector and SIMD Computers
Problem 8.10
{a) Pipeline chaining for CVF execution:
(b) Space-time diagram:
aces! unas \_ tm \ Lat \ wea \
Molly
Aas‘Multivector and SIMD Computers 127
Problem 8.11
(a) The 11 vector instructions needed to perform the given CVFs on Cray X-MP are
shown in the follows:
M(B: B+63)~ Vi
M(C:C +63) V2
sx V24 V3
Vaivisva
V4 M(A: A+ 63)
ex V1 V5
Vax V23V6
Vo M(D: D +63)
V2-Viav7
Vax VT4Vv8
V8 M(B: E +63)
(b) Space-time diagram for the execution of the CVF code:
Load 1
tit Fiowe |
Ad vow vaevi
(¢) Execution of the CVFs using pipeline chaining on Cray 1128 Multivector and SIMD Computers
Tine
ints
peel of Cray X-MP Cray 1 = ————_
8 oe beiaiin iad An +m + 28
1.25 for lange n.
Problem &.12
(a) Average execution rate can be computed as
a+(1-a)
B= aR + a/R,
10
= yp ran (Mops).
(b) The plot is shown below:
rn ny
peMultivector and SIMD Computers 129
(c) We have
Hence a = 26/27 = 0.963,
(4) With the given data, the following equation is obtained
Ry
Ry = 0.7(Ry = 1)
2,
which can be solved to give R, = 3.5 Mftops.
Problem 8.13
(a) The algorithm to compute the expression in a serial computer is shown below:
s= Ay x By
For i = 2 to 32 Do
= st Aix B
Enddo
‘There are 32 multiply operations and 31 add operations. The number of time
units needed is 32 x 4 + 31 x 2 = 190.
(b) The algorithm for the SIMD computer is shown below:
Parfor j to 8 Do
s(j) = Aij x Bi; /* 1 multiply operation */
Fori =2to4 Do
s(j) = s(j) + Ay x By /* L muitiply and 1 add operations */
Enddo
8(j) = s(J) + (7 +1) /* 1 routing and 1 add operations */
(3) = 9(9) + 5() +2) /* 2 routing and 1 add operations */
s(j) = 9(j) + s( +4) /* 4 souting and 1 add operations */
Enddo
‘There are 4 multiply operations, 6 add operations, and 7 routing operations. The
time needed is 4x 446 x 2+7x 1 = 35 cycles
Problem 8.14
(a) A Cray Y-MP C-90 has 16 processors. Each processor has 2 vector pipelines.
Each pipeline has a floating point multiply and an add unit which can operate
concurrently. Therefore, two floating point operations can be performed each cycle
in a vector pipeline. Total operations performed iu a.cycle are 16x2x2-= 64. It has
a cycle time 4.2 ns. Hence, the peak performance = (64 floating-point operations)
/ (4.2 ns) = 15.2 Gfops.130 Multivector and SIMD Computers
(b) An NEG SX-X has 4 processors. Bach processor has 4 sets of vector pipelines.
Each set has two add/shift and two multiply/logical pipelines. Total operations
performed in a cydle are 4x 4x 2x 2= 64. Its cycle time is 2.9 ns. Thus the peak
performance = (64 floating point operations) / (2.9 ns) ~ 22 Gflops.
{c) Both machines perform 64 floating operations per cycle as explained above.
Problem 8.15
(a) Matrices A and B are both divided into blocks, each of size 8 x 8. Denote the
blocks as Aj; and By, respectively, for 0 < i,j < 7. Cannon's algorithm for
matrix multiplication is used in this problem. ‘The following diagram shows the
initial distribution of matrices A and B among the PEs. The submatrix blocks
are stored in a skewed maaner. The diagonal subblocks of A appear in the first
column, those of B appear in the first row.
Hoo [Aoi | Aco {Acs | Ave [Aas [Ave | Aor
An {Ar | Ais | Ais {Ais | Ais | Arr | Ato
Aaa | Aas | Ase { Aas | Aas |-Aar | Azo | An
Aas | Ass | Aas | Aas | Ast [Aso | Ass | Ase
Aus | Ass | Aue [Ast | Aso | An | Aaa | Avo
‘Ase | Ase | Ast | Aso | Aes | Ana | Asa | Ase
‘Ass | Act | Aco {Asi | Aco | Ass | As | Ass.
Arr Varo [An | Ara [Ars | Are | Ars | Are
Boo | Bir | Bu | Bss | Bas | Bss | Bes | Ber
Bio | Ba | Biz [Bos [Bos [Bos | Bre | Bur
Bay | Bas | By | Bos | Bes | Brs | Bos [| Biz
Byo | By | Bsa | Bos | Bra | Bos | Bis | Bar
Boo | Bsr | Beo [Bes [Bos | Bis | Boo | Bsr
Bso | Ber | Bra | Bos | Bis | Bas | Bas | Bur
Beo | Bry | Boo [Bis | Bos | Bos | Bas | Bor
Bro | Boy | Bis | Bos |Buy [Bas | Bas | Bor
Blocks of C are stored in the natural order in PEs as shown below.
(Goo [Cox [ Cea [Cos | Cos | Cos | Cos | Cor
[Gro [Ou [Cre [is | Cre | Cis | Cre | Cor
Gao [Crs | Can | Cas | Cos [Cos | Cas | Car
G20 | Cox | Ca | Coa | Cae | Cas | Cop | Car
Cy [Ca | Cag | Cas | Cae | Cas | Cas | Cor
C50 | Cos | Coz | Cos | Coa | Cos | Coo | Cor
Goo [Cox [Cex | Con | ss | Cos | Cos | Cor
Gro | Cn | Cra | Crs [Cra | Crs | Cre | Crr
(b) ‘Phe overall algorithm is specified as follows for each PE:Mulltivector and SIMD Computers 1a
For i=0 to 7 Do
Compute the product of block submatrices of A and B residing in
it and add the product to the part of matrix C.
Pass block submatrix of A to its left neighbor in a wraparound
fashion using shift operations.
Pass block submatrix of B to its upper neighbor in a wraparound
fashion using shift operations.
Enddo
Basically, in step 1 of the ith iteration, PExy performs the following compu-
tations:
Ce = Cha + Ae tiss)mods Beis 3)modst
where j is the initial column index of the block submatrix of A residing in PEs...
It is straightforward to specify the detailed operations for the multiplication of two
submatrix blocks in each PE.
Steps 2 and.3 exchange matrix elements among the PEs, In the last iteration,
they bring individual submatrices back to the PEs in which they are initially
resident. Note that all the PEs perform identical operations on different data, in
keeping with the SIMD mode of operation.
{c) The multiplication of 8 x 8 matrix blocks in each PE and accumulation into C
take 8° multiplication and 8* addition operations. Steps 2 and 3 require 8 shift
operations each. Therefore, the number of cycles needed in each iteration is 2 x
8 42x 8? = 1152. The total number of cycles in 8 iterations is 9216. If the shift
operations of the last iteration are omitted, 128 cycles can be saved.
(a
If data duplication is allowed, each block submatrix of A is duplicated along the
row, and each submatrix of B is duplicated along the column by the following
instructions:
For i= 0t07 Do
PEs in column i broadcast submatrices of A
to other PEs in the same row.
PEs in row i broadcast submatrices of B
to other PEs in the same column.
Enddo
Now, each PE has all the elements needed to compute a subblock of C matrix
and no further data movement is required. So the last step is for all PEs to compute
the submatrix blocks of C simultaneously. The arithmetic operations are identical
to those in (b). Possible saving in execution time comes from the reduction in
communication overhead if broadcast operations can be carried out efficiently.
Problem 8.16 The comparison of CM-2 and CM-5 is summarized in the table below;
more detailed comparison can be found in relevant manuals.132 Multivector and SIMD Computers
Machine [Architecture | Operation | Potential] Tmprovement
Mode Performance
M2 | 64K bit-slice | SIMD 10 Gfiops|
processors,
Hypercube
OM Pi6K SIMD 2 Tops | mixture of
SPARCs, | MSIMD. parallel
Avary fat tree | Syne. MIMD | techniques
Problem 8.17 The linear combination can be written as
{yo,th doz) agxX9 + 01% +... + Ayoa5X1023
89 * (0.05 51,05--»21023,0) + @s X (20,3, Z1.t5-s T0081) Fo +
09a X (o,s028521,20235 -- F1023,1022)
= (aot09 + a1203 +. + ar029% 93091, 071 9 + OLE bon +
102 1,1028) ~-Boti003.0 + a1F1029,1 +. + G1028-2025,1073)
‘Thus, we have the following equalities:
om
w= Vasey, fori =0,1,..., 2023. (8.1)
Fad
(a) From Eq. (8.1) in the above, we see that each element of y can be computed
separately. Thus, each processor can be used to carry out one-fourth of the com-
putations — processor £ computes elements (£ — i) x 256 through £ x 256 — 1
‘Vector a is replicated in all processors, ‘The multiplier and adder in each processor
are chained a8 shown in the following diagram:
aN
tb
_t
ra
Without loss of generality, consider the operations performed by processor 0
In each processors, two auxiliary vectors are used. C is a vector of 4 elements whichMaltivector and SIMD Computers 133
are initialized to 0, Vector D(0 : 1023) is used to store intermediate results. Let's
examine the computation of yo. ‘The computations are divided into two phases.
In cycle 0, ag and zy are fed into the multiplier. After 4 cycles, their product
appears at one input of the adder. After four more cycles, the value ao:te9 ap-
pears at the output and is routed back to the input port for C (see the diagram).
‘Thereafter, one more product term is added to Co in every four cycles. A similar
situation holds for the other elements of C. For a description of pipeline chaining
for this purpose, sce [Hwang84], pp. 279-280.
After all the product terms for yp have been accumulated in the adder, el-
ements of vector C have the following values: Cy = S228 agesazoerve for k
0,1,2,3. Just prior to the arrival of product terms for y at the adder, C(O : 3)
are stored in D(di : 4i +3) one by one and C(0 : 3) are reset to zero.
‘This process is repeated for successive elements of y. In this way, paits of
elements of a and z can be continuously fed into the multiplier in each cycle.
‘Thus, at the end of the first phase, D has 256 “segments” of 4 elements each. This
phase takes 256 x 1024 + 8 — 1 = 262, 151 clock cycles.
In the second phase, each segment in D is summed up to obtain one element of
y. This can be done by first geuerating 256 pairs of partial sums (two elements in
each segment are added). Then each pair of the partial sums is added to produce
the final result. In the optima! case, the first four add operations can be overlapped
with the last four add operations of the first phase. Therefore, the total number
of cycles ueeded for phase 2 is 512 + 256 = 768. Consequently, the total nnmber
of cycles for the multivector system is 262,151 + 768 = 262,919.
Note that if the two phases of computations are interspersed, then the vector
Dis not needed. But the timing is not optimal,
(b) On a single processor without vector processing capability, the number of opera:
tions is 1024 « 1024 multiplications and 1024 x 1023 addition. Each operations
takes 4 cycles, giving a total of 6.384.512 cycles. Therefore, the speedup of the
multivector system over the single scalar processor is 8, 384, 512/262, 919 = 31.89,
which is close to the theoretical maximum value of 32.
In the above analysis, pipeline startup time has been neglected and a very
intelligent schedaler is assumed. Actual performance may be poorer when various
overheads are taken into account.
Problem 8.18 Suppose low-order interleaving is used so that consecutive elements of
a vector are stored in contiguous memory modules, Without loss of generality, assume
that the first element {element 0) of the vector is stored in memory module 0. Let s be
the stride of a vector access, aud my = miss and na = ms be the indices of two different
elements retrieved. Assume n, > ng. The memory modules in which the two elements
reside are n; mod 17 and nz mod 17, respectively. Now
(n mod 17} — (mz mod 17) = (nm: — nz) mod 17
= (sm) ~ sm) mod 17134 Multivector and SIMD Computers
af o if (my — mp) mod 17 = 0
= \ ((s mod 17)((m; — mz) mod 17)) mod 17 #0, if (my — mz) mod 17 #0
‘The second result follows from the given condition s mod. 17 # 0. Therefore, if (my —
mz) mod 17 #0 for any pair m; and mp, there will be no conflicts in memory accesses, i
‘Normally, if the elements are accessed in increasing order and at most 17 elements are
accessed at a time, then (1m, —_mz) mod 17 #0, which ensures confict-free accesses,Chapter 9
Scalar, Multithreaded, and
Dataflow Architectures
Problem 9.1
(a) The efficiency can be computed as
a/R
RSL
Zr
(b) The new rate of remote memory requests is R’
1—A)R. Hence,
1 1
T+RL i+ RIG—aj
(©) EN > Na= ago th
1
i¥G-AjeRr
HNN,
Ba=— NR NL
in TR + CFE V4RCHRL 1+ (1-AE +O)
(4) To compute £, we need to compute mean internode distance D. Let P be the
probability that a node sends a message to all other nodes with distance i. In
reference to Problem 2.11, D can be computed as follows:
285136 Scalar, Multithreaded, and Dataflow Architectures
Sinee D = 27/2] = r,
rt4
aire itl or-itl _ Ar-i+)) 2
D= SE Aiea eel) Wey yee
ar+) TDG btm
1/Rt 1
Ens WR +C~ 1+0—hCR
N
Fin = TEU HREOC) 1+ (1- AYRE +t] +0)
Problem 9.2 The architectural assumptions and notations used in this problem are
similar to those in (Saavedra90]. A deterministic model is adopted in the analysis,
Summarized below are basic system parameters to be used:
«Ns The number of threads or contexts that can be executed simultaneously in
each processor.
* C: The content switching overhead which accounts for the cycles lost in perform-
ing context switch in a processor.
© L: The communication latency for a processor to access a a remote memory
throngh the network.
+R: The run length of a single thread before it issues a memory request or is
switched ont. Note that the definition here is the inverse of the definition of
in Problem 9.1.
© fe The coverage factor for prefetching, defined as the percentage of memory
Tequests successfully prefetched to satisfy the demand of a thread
» E: The processor efficiency, defined as the percentage of time a processor is
actively exeenting a thread.
(a) Effectively, prefetching reduces the memory latency from L to L! (L' < L). a
memory request has been prefetched, the time spent on the request equals V +L’,
where V is the overhead for prefetching, which includes the effects of extra instruc.
tions inserted to perform prefetching, (Assume software prefetching technique is
used.} The processor efficiency F of a single-threaded processor with prefetching
‘can be expressed as
R
Fei s Geek (o.4)
‘The latency for remote access is reduced from L to f(V + L') + (1 ~ f)L.Scalar, Multithreaded, and Dataflow Architectures 137
(b) Based on the reduced latency, two different regions in the efficiency curve can be
identified:
R SV 4LN40- NL
pul ee : 2" Re Bie +1
R _ tw < fVFHeO-Ne
FV FT)+0-ALERIC R+C
Problem 9.3 For this problem, the same parameters as defined in Problem 9.2 are
used.
(a) The major benefit of release consistency lies in allowing the read requests to bypass
outstanding trite requests and allowing write requests to be pipelined. Therefore,
the processor stalls only for read requests or when the write buffer is full. This
probability of a write buffer being full is usually very low if the write buffer is large
enough in capacity.
Let w be the probability of a request being a write. The processor efficiency
with release consistency alone can be expressed as:
R
E=———*___.,
REO web
where b is « parameter that depends on the buffer capacity, network delay, w, and
the rate at which remote memory accesses are requested by each processor.
(2)
(b) With release consistency model, the number of threads needed to completely hide
the latency is
(= w)L + wb
N,
Reo *?
Thus, the efficiency is
R
La >
a NR a vee
3 ign Vth) tO
G-Qr+ubs RHC REC
(c) Ifprefetching and release consistency are both employed, the latency will be further
reduced. Combining the results in Eqs. 9.1 and 9.2 above, we obtain the following
expression for the effective latency in a single-threaded processor:
Ley = f(V 4) 4 (1 NA w)L + ub)
‘Thus, the efficiency of a single-threaded processor is
ae
~RiLg
Based on Leg, we can determine the number of threads needed to fully hide
the latency in a multithreaded processor as
IV +E) += PC -w)L +d)
R+C
Ex
Nor =138 Scalar, Multithreaded, and Dataffow Architectures
Hence, the efficiency in a multiple-threaded processor is
R
Emad RC wal
FUL) + (1 FG wh +d) +R +C
if N= Myr
NSN,
Problem 9.4
(a) We know m processors are attached to each column bus, since there are m row
buses in the system, Bach generates r requests per second om the bus. Thus, the
total request is mr. Suppose each request consists of t bits, assuming a uniform
length for all requests. (Alternatively, ¢ can be taken as the average length of each
request.) Then the following relation holds:
mrt = Ba.
‘Therefore, the memory bandwidth is
Ba mt
@
(b) Assume all the buses (row or column) have the same bandwidth. There are 2m
buses in the system. Hence, the total bus bandwidth is 2mB = 2m?*rt/a.
(c) There are m? processors in the system, cach generating r requests per second.
If each request uses only a row bus or column bus, then the total bandwidth
requirement is mrt. This has to be satisfied by the available memory bandwidth,
which is 2mB. Therefore
mort < 2mB.
Hence,
ro 2B
> int
(a) If all the processors send requests that need to go through two buses (one cohumn
bus and one row bus), then at 2 certain instance of time, there would be mr +
intr = 2mér requests that need to be serviced by the bus system. Therefore, the
total bus bandwidth needed is 2m?rt
(e) The bus bandwidth of the multicube system is designed to allow a bus utilization
rate of at most I (i.,0 < a < 1), In (d}, the bus bandwidth requirement represents
the maximum bandwidth demand, From the relation
2m? rt < Imért/a,
it is concluded that the available bus bandwidth provided by the multicube is
adequate.
Problem 9.5Scalar, Multithreaded, and Dataflow Architectures 139
(a) See Fig. 4 in [Hwang91}.
(b) Because of the column and row access modes available on the OMP, special instrue-
tions are needed. Please refer to the original paper [Hwang91] for a description
of the instructions, data distribution, and SPMD program for performing matrix
multiplication.
(c) The number of orthogonal memory accesses is 2V3/n? + 2N?/n + N2/n?. The
number of synchronizations is 2N?/n®. For details, see the proof of Lemuna 1 in
(Hwang91]
(a) Two-dimensional FFT requires 4N? /n? orthogonal memory accesses and one syn-
‘hronization. For a description of the SPMD program and complexity analysis, see
(Hwang91]
Problem 9.6
(a) ‘The SVM retains the programming paradigm of a tightly-coupled shared-memory
anultiprocessor, which directly supports data sharing among processes. This pro-
motes portability of programs across systems, In addition, it has the advantages of
a distributed memory machine. The large virtual address space allows programs to
be much larger in code and data space than the physical memory on an individual
node. Moreover, remote memory can be used as an added level of the memory
hierarchy between the local memory and disks to improve the performance, Thus,
SVM provides such desirable properties as low cost and scalability by getting rid
of hardware bottlenecks.
(b) in SVM systems implemented by OS (such as IVY), it is convenient to use the
underlying virtual memory page size as the unit of sharing among processes. In
hardware-implemented SVM systems (such as Dash), the unit of sharing is usually
much smaller, typically the size of a cache block. Some of the differences are listed
below:
Page-tevel sharing is more effective for exploiting locality of references in
shared-memory processes. But it is also more susceptible to contention
among processes (more than one process trying to access the same page)
© Page-level sharing is more likely to cause false sharing, That is, two pro-
cesses may access completely different parts of a page.
@ The size of directory is much larger if the unit of sharing is cache blocks
instead of pages. The storage demand of directory information can be ex
Problem 9.7
(a) © To implement RC, it needs two memory instructions (load-lock and store:
unlock) and a lockup-free cache and some kind of scoreboarding to keep
track of outstanding requests.140 Scalar, Multithreaded, and Dataflow Architectures
‘« Toimplement PC, it needs multiport memory to allow processors to perform
out-of-order writes.
« To implement WG, it needs store buffers in each processor with some match-
ing hardware to bypass loads.
(b) Different consistency models impose different constraints on the order of shared
memory accesses by each process. The following diagram adapted from [Gharachorloo9]}
illustrates the event ordering according to PC, WC, and RC models. In the figure,
L stands for load, $ for store, A for acquire, R for release; an arrow means program
order has to be observed; loads and stores in the same block can be executed in
any order provided dependence relations are respected. Subscripts to acquire and
release operations stand for synchronization variables or memory locations. 5
|
|
bs Blea He bis -Bhede
| c
Some of the advantages and shortcomings of each model are summarized
below. See, for instance, [Gharachorloo91, Mosberger93] for further discussions.
«Advantages of PC: Loads are allowed to overtake store accesses by the same
processor if the accesses arc to different locations. If a load and a store
fare to the same memory location, the load can be satisfied by the storeScalar, Multithreaded, and Dataflow Architectures wat
operation, as in TSO or PSO model. Thus, a load never stalls for pending
stores.
‘* Shortcomings of PC: Store operations in each processor have to follow pro-
gram order, making the chance of a write buffer being full higher, which
means the processor is more likely to be stalled.
* Advantages of WC: WC ensures sequential consistency only at synchroniza-
tion points. Load/store operations between synchronization points can be
performed in any order as Jong as control and data dependences are not vi-
olated in each processor. Hence, store operations can be pipelined, leading
to improved performance.
# Shortcomings of WC: Processor is stalled at an acquire operation, waiting
for previous stores and release to complete. It is also stalled at the first load
following a release operation. As a result, in fine-grain computations with
frequent synchronizations, WC can perform poorly compared to PC.
© Advantages of RC: The shortcoming of WG for fine-grain computations is
eliminated since RC does not block a processor at a load /acquire for previous
store/release operations to complete. Independent synchronizations do not
noed to wait for the completion of each other as shown in the diagram.
‘Therefore, a higher degree of parallelism can be realized.
@ Shortcomings of RC: RC requires more complex hardware/software sup-
port for implementation (see {a}. Special language construct and compiler
support are needed to properly label a program and generate the code for
execution in this model.
Problem 9.8
(a)
«* It provides the communication, synchronization, and global naming mech-
anisms required to efficiently support fine-grain, concurrent. programming
models,
@ It extends a conventional microprocessor instruction set architecture with
instructions to support parallel processing,
« It provides hardware support for end-to-end message delivery including for-
matting, injection, delivery, buffer allocation, buffering, and task scheduling.
‘ It supports a broad range of parallel programming models, including shared-
memory, data-parallel, dataflow, actor, and explicit message-passing, by
providing low-overhead primitive mechanism for communication, synchro-
nization, and naming, Its communication mechanisms permit a user-level
task on one node to send a message to any other node in a 4096-node max
chine in less than 2 js.
(b) All messages route first in the X-dimension, then Y, then Z.
(c) The AAU performs all functions associated with memory addressing. It contains
the address and ID register to support naming and relocation. It protects memory
accesses and implements the transiation instructions. It maintains two queues to142 Scalar, Multithreaded, and Dataflow Architectures
buffer incoming messages and schedule the associated tasks.
(d) See Example 9.4.
Problem 9.9
(a) In the VEST system, networks with many dimensions require more and longer wires
than low-dimensional networks. Thus, high-dimensional networks cost more and
Tun more slowly than low-dimensional networks. Under the assumption of constant,
wire bisection, low-dimensional networks have wide channels, and high-dimensional
networks have narrow channels. With wormhole routing method, which is used
by most of the second- and third-generation multicomputers, the wider channels
provide a lower latency, less contention, and higher hot-spot throughput.
(b) We can treat the router at each node as the stage, and the iit buffer as the stage
latch in a superpipelined functional units, Information is transmitted (processed)
from a router (stage) to another. The differences are:
Most of the pipelined functional units are synchronously operated.
# Pipelined functional units have fixed data flow patterns, but the message
passing mechanism may dynamically change its data flow by the routing
information in the header fits.
Problem 9.10
(a) @ The memory is initially in the home state (uncached), and all cacke copies
are invalid, Sharing-list creation begins at the cache where an entry is
changed from an invalid to a pending state. When a read-cache transaction
is directad from a processor to the memory controller, the memory state
changed from uncached to cached and the requested data is returned.
‘The requester’s cache entry state is then changed from a pending state
to an only-clean state. Sharing list creation is illustrated in the figure below.
Multiple requests can be simultaneously generated, but they are processed
sequentially by the memory controller.
new new
[Leencina] Processors [oniy cian
read cached
My
home | Memory [cached
Before After
© For subsequent memory access, the memory state is cached, and the cache
head of the sharing list has possibly dirty data. As illustrated in the figure
below, a new requester (cache A) first directs its read-cache transaction to‘Scalar, Multithreaded, and Dataflow Architectures 143
memory but receives a pointer to cache B instead of the requested data.
Processors
old
[reveieo]. F[ouston Je) [rows aidstt) Je)
Memory cached
Before After
A second cache-to-cache transaction, called prepend, is directed from
cache A to cache B, Cache B then sets its backward pointer to point to
cache A and returns the requested data. The dashed lines correspond to
‘transactions between a processor and memory or another processor. The
solid lines are sharing-list pointers. After the transaction, the inserted cache
‘A becomes the new head, and the old head (cache B) follows cache A in the
chain.
(b) Compared to backplane bus, chained directory provides a greater bandwidth and
better scalability. Its cost can be cheaper since snoopy cache coutrullers are not
needed. It allows an invalidation signal to be sent to specific processors instead of
broadcasting the signal to all processors. However, it may take a longer time for
the signal to reach all the processors involved.
‘The advantage of a chained directory compared to a full-map directory is the
saving in space needed to store directory information. Suppose there are P pro-
cessors in the system and the number of memory blocks is M. Typically M is
proportional to P. If a fulkmnap directory is used, a presence bit is needed to
indicate whether 2 processor has a particular memory block in its cache. The to-
tal number of presence bits is O(MP) = O(P?). On the other hand, if chained
directory is used, each block only needs to maintain a pointer to the first pro-
cessor that caches the block. Each pointer takes O(log P) bits, thus a total of
O(M log P) = O(Plog P) bits is needed. This saving also makes SCI more scal-
able than full-nap directory.
Compared to full-map directory, chained directory has two disadvantages.
First, the time it takes to send an invalidation signal to all processors that have
a cache copy of a memory block may be longer when the number of processors is
large. The reason is that in full map, the invalidation signal can be sent to all such
processors in parallel, whereas with the use of chained directory, the invalidation
is propagated through the chain, which can take a long time. Second, the protocol!
design may be more complicated. Because of the longer delay, race conditions are
more likely to arise, which have to be taken into account in protocot design.
Problem 9.21 Different context-switch poticies affect the average busy time R.144 Scalar, Multithreaded, and Dataflow Architectures
(a) In owitch on cache miss policy, memory access with long latency will be involved.
Thus, context switch makes good use of the idle time. ‘The overhead is the time
taken to determine whether a cache hit or miss has occurred. If switch on load
scheme is used, the aforementioned overhead is eliminated. But R is likely to be
smaller than switch on cache miss.
Switch on every instruction interleaves instructions from different contexts
on a cycle-by-cycle basis, irrespective of whether a load operation is encountered.
‘The independence among successive instructions can hide pipeline dependences,
hence improving pipelined execution efficiency. On the other hand, locality may
bbe jeopardized, which results ina lower cache hit ratio, A scheme which interleaves
contexts in blocks of instructions improves the locality of references. But the degree
of dependence among successive instructions is higher than that in switch on every
instruction scheme. The determination of a suitable block size can be difficult.
(b) Each context-switch scheme has its merits and drawbacks. Thus, more research
is needed to determine which one provides the best performance. The choice will
depend on other performance parameters as well. For instance, context-switch
cost and memory access latency are likely to influence the decision. ‘The behav-
jor of programs should also be taken into account. Both analytical analysis and
simulation will be useful in assessing the performance of different models.
Problem 9.12 Dash uses a distributed shared memory architecture which combines
the ease of using shared memory and scalability of message-passing systems.
(a) Dash uses an invalidation-based cache coherence protocol. See Fig. 7.15 in the text
for the cache states and the events causing transitions from one state to another.
(b) A home cluster maintains the directory and physical memory location of a memory
address. Each entry in the directory corresponds to a memory block. It has a
presence bit for each processor cache. In addition, a state bit indicates whether
the block is uncached, shared, or dirty.
A memory access is satisfied by going through the hierarchy of processor cache,
local cluster, home cluster, and finally remote clusters. The directory information
makes it, possible to send invalidation signals to those processors which have a copy
of a memory block instead of broadcasting to all processors. It also helps decide
when a memory block needs to be written back to main memory.
{c) See Example 9.5 in the text.
(d) See Example 9.5 in the text.
Problem 9.13
(a) The KSR-1 offers a singlelevel memory, called ALLCACHE. This ALLCACHE,
design represents the confiuence of cache and shared virtual memory concepts that
exploit locality required by scalable distributed computing. Each local cache has
capacity of 32 Mbytes (2° bytes). The global virtual address space has 2*° bytes.Scalar, Multithreaded, and Dataflow Architectures 145
(b) With ALLCACHE, an address becomes a name, and this name automatically
migrates throughout the system and is associated with a processor in a cachelike
fashion as needed. Copies of a given cell are nade by the hardware and sent to
other nodes to reduce access time. A processor can prefetch data into a local cache
and poststore data for other cells. The hardware is designed to exploit spatial and
temporal locality. When a processor writes to an address, all cells are updated and
memory coherence is maintained.
(c) Both systems have distributed main memory, scalable interconnection networks,
and directory-based coherence scheme. Dash allows pages to be migrated among
processors. DDM has a COMA architecture, which replaces the private memory at-
tached to each node by a huge secondary/tertiary cache, called attraction memory.
Data blocks can be migrated or duplicated among processors. Processing nodes in
both Dash and DDM are clusters of multiple processors. Dash uses a wormhole-
routed mesh interconnect, whereas DDM uses a hierarchy of buses. Refer to the
papers for more details.
Problem 9.14
(a) Some of the design goals of the Tera architecture are listed below:
* Very high-speed implementations — The architecture should have a short
clock period and be scalable to many processors.
+ Applicability to a wide spectrum of problems — Programs that do not
vectorize well due to a preponderance of scalar operations or too frequent
conditional branches should execute efficiently as long as there is sufficient
parallelism to keep the processors busy.
Ease of compiler itnplementation — The design of the instruction set should
simplify the task of the compiler in generating code that can exploit paral-
lelisin efficiently
(b) The interconnection network of one 256-processor Tera system contains 4096 nodes
arranged in a 16 x 16x 16 toroidal mesh; i.e., the mesh “wraps around” in all three
dimensions. Of the 4096 nodes, 1280 are attached to the resources comprising
256 cache units and 256 1/O processors. ‘The 2816 remaining nodes do not have
resources attached but still provide message bandwidth.
‘To increase node performance, some of the links are missing. If the three
directions are named x, y, and 2, then «links and y-links are missing on alternate
‘slayers. This reduces the node degree from 6 to 4, or from 7 to 5, counting the
resource link. In spite of its missing links, the bandwidth of the network is very
large.1468 Scalar, Multithreaded, and Dataflow Architectures i
Stream Status Word (SSW)
+32 bit PC (Program Counter)
* Modes (e.., rounding, lookahead disable)
+ Trap disable mask (eg, data alignment, overflow)
+ Condition codes (last four emitted)
No synchronization bits on RO-R3T
Target Registers (T0-T7) look like SSWs
(c) Bach processor in ¢ Tera computer can execute multiple instruction streams (threads)
simultaneously. In the current implementation, as few as I or as many as 128 pro-
gram counters may be active at once. On every tick of the clock, the processor logic
selects a thread that is ready to execute and allows it to issue its next instruction.
Since instruction interpretation is completely pipelined by the processor and by
the network and memories as well, a new instruction from a different thread may
be issued in each tick without interfering with its predecessors.
‘When an instruction finishes, the thread to which it belongs becomes ready to
execute the next instruction. As long as there are enough threads in the processor
so that the average instruction latency is filled with instructions from other threads,
the processor is being fully utilized. ‘Thus, it is only necessary to have enough
threads to hide the expected latency (perhaps 70 ticks on average); once latency
is hidden, the processor is running at peak performance and additional threads do
not speed the result.
Ifa thread were not allowed to issue its next instruction until the previous in-
struction is completed, then approximately 70 different threads would be required
on each processor to hide the expected latency. The lookahead described later al-
lows threads to issue multiple instructions in parallel, thereby reducing the number
of threads needed to achieve peak performance.
{d) Bach thread has the following states associated with it
+» One 64-bit stroam status word (SSW);
© Thirty-two 64-bit general-purpose registers (RO-RS1);Scalar, Multithreaded, and Dataflow Architectures M47
Eight 64-bit target registers (T0-T7).
Context switching is so rapid that the processor has no time to swap the
processor-resident thread state, Instead, it has 128 of everything, i.o., 128 SSWs,
4096 general-purpose registers, and 1024 target registers. It is appropriate to
compare these registers in both quantity and fimetion to vector registers or words
of caches in other architectures. In all three cases, the objective is to improve
locality and avoid reloading data.
Program addresses are 32 bits in length. Bach thread's current program
counter is located in the lower half of its SSW. The upper half describes vari-
‘ous modes (e.g., floating-point rounding, lookahead disable), the trap disable mask
(eg., data alignment, foating overflow), and the four most recently generated con-
dition codes,
‘Most operations have a TEST variant which emits a condition code, and
branch operations can examine any subset of the last four condition codes emitted
and branch appropriately. Also associated with cach thread are thirty-two. 64-bit
general-purpose registers, Register RO is special in that it reads as 0 and output
to it is discarded. Otherwise, all general-purpose registers are identical
‘The target registers are used as branch targets. ‘The format of the target
registers is identical to that of the SSW, though most control transfer operations
use only the low 32 bits to determine a new PC. Separating the determination
of the branch target address from the decision to branch allows the hardware to
prefetch instructions at the branch targets, thus avoiding delay when the branch
decision is made. Using target registers also makes branch operations smaller,
resulting in tighter loops. There are also skip operations which obviate the need
to set targets for short forward branches.
One target register (T0) points to the trap handler which is nominally an
unprivileged program. When a trap occurs, the effect is as if a coroutine call to
a TO had been executed. This makes trap handling extremely lightweight and
independent of the operating system. Trap handlers can be changed by the user
to achieve specific trap capabilities and priorities without loss of efficiency.
(¢) The Tera architecture uses a new technique called explicit-dependence lookahead.
Each instruction contains a 3-bit lookahead field that explicitly specifies how many
instructions from this thread will be issued before encountering an instruction that
depends on the current one. Since seven is the maximum possible lookahead value,
at most 8 instructions and 24 operations can be concurrently executing from each
thread.
A thread is ready to issue a new instruction when ail instructions with looka-
head values referring to the new instruction have completed. Thus, if each thread
maintains a lookahead of seven, then nine threads are noeded to hide 72 ticks of
latency,
(1) The Tera uses multiple contexts to hide latency. The machine performs a context
switch every clock ¢ycle. Both pipeline latency (eight cycles) and memory latency
are hidden in the HEP/Tera approach. ‘The major focus is on latency tolerance148 Scalar, Multithreaded, and Dataflow Architectures
rather than latency reduction.
With 128 contexts per processor, a large number (2K) of registers must be
shared finely between threads. The thread creation must be very cheap (a few
dock cycles). Tagged memory and registers with full/empty bits are used for
synchronization. As long as there is plenty of parallelism in user progranis to hide
latency and plenty of compiler support, the performance is potentially very high.
However, these Tera advantages may be embedded in a number of potential
drawbacks. The performance can be bad for limited parallelism, as in the case
of single-context environments. On the other hand, a large number of contexts
(threads) require lots of registers and other hardware resources which in turn im-
plies higher cost and complexity. Finally, the limited focus on latency reduction
and cacheing entails a high degree of parallelism and a high memory bandwidth in
order to hide latency; both tend to drive up the cost in building the machine.
Problem 9.15,
(a) Static dataflow computers do not allow more than one token to reside on the same
arc of a dataflow graph. ‘The firing rule for an operator node is that all the input
tokens are present and there is no token on the output arc(s). The implementation
Tequires extensive acknowledge signals,
Dynamic dataflow computers allow more than one token to be on the same
arc simultaneously. Bach token is associated with a tag. When tokens of identical
tags are present on all the input arcs of an. operator, it is fired.
~B. + JB ~ TAG,
2A;
(b) The root of Aix? + Bir; + C; = 0 can be computed as 2;
‘The dataflow graph is shown in the following diagram for any i
‘There are 11 nodes, each with two input arcs and one outpnt arc. ‘The output
tokens of the nodes are labeled a through i.Scalar, Multithreaded, and Dataflow Architectures 149
(c) Partition of the computations among the PEs is shown in the above diagram. The
partition is not unique for achieving a balanced load among processors. Suppose
each computation takes one clock cycle. Three of the PEs execute three computa-
tions and the fourth one (PE2}executes two computations. The average latency of
one iteration is 3 when the computation reaches the steady state. The schedule is,
shown in the following table with the subscript of each output token corresponding
to the loop index i
PEO PEL PE2 [| PES
a ti @ 4
a
i ee
% en
fy te an i
i fe % fi
ae 2 es b
Te by i
de i a Ter
as 28 ee ds
Problem 9.16
(a) For each y(i), m multiplications and m —1 additions need to be performed, giving
a total of mn multiplications and (m — 1)n additions.
(b) ‘The computations of individual elements of y are independent of each other. Hence,
the computations can be partitioned as follows: y(i) are computed by processor 0
for i = 0.63, by processor 1 for i = 64..127, by processor 2 for i = 128..191, and
by processor 3 for ¢ = 192.255.
(c) Using the above partition, each processor will need to have 67 elements of z and
all four elements of w for circular convolution. For instance, processor 0 needs 2(0)
through 2(63) and 2(258) through (255). Note that the extra 3 elements to be
fetched into each processor reside in memory modules 29, 30, and 31, respectively.
‘This is a result of how the vector elements are stored and the naturo of circular
convolution. ‘Therefore, proper interleaving is required to avoid conflicts. This
interleaving is facilitated by the assumption of enough registers in each processor
so that memory access and arithmetic operations can be performed in separate
phases.
The ith elements of x and y are stored in memory module j = i mod 32.
Flements of vector w are stored in a similar fashion in memory modules 0 through
3. With the storage scheme, each memory module stores 8 elements of vector z.
Lo”
(Qj)150 Scalar, Multithreaded, and Dataflow Architectures
Modules 0 through 28 will be accessed 8 times each. Modules 29 through 31 will
be accessed 12 times due to access contentions described in the above. To fetch w
into all the four processors takes 4 cycles. ‘Fhe access of 67 elements of x into each
of the four processors takes another 67 cycles.
‘Computations of y can be carried out concurrently in all four processors, each
responsible for 64 clements, resulting in a total of 4 x 6443 x 64 = 448 cycles.
Finally, the elements of y are stored back to the memory at a rate of 4 elements
per cycle, taking 64 cycles. Therefore, the total parallel computing time is
ty = 4467 + 448 + 64 = 583. cycles.
(d) Ha single processor is used, the following steps are required:
Fetch w from memory in 1 cycle,
Fetch x from memory in 64 cycles,
Compute y in 256 x (4+ 3) = 1792 cycles,
Store y into memory in 64 cycles.
Thus, the execution time by a single processor is
ty = 1464417924 64 = 1921 cycles.
‘The speedup using 4 processors over a single processor is,
4 ftq = 1021/583 = 3.3.
Problem 9.17
(a) A fine-grain processor typically has a small amount of memory associated with it.
Tn the construction of large-scale computer systems, fine-grain processors match
better with fine-grain software parallelism and have cost advantage over medium:
grain processors.
(b) In a uniprocessor system, there is only a single address space. Many programs
have been developed based on this concept. A single global address space offers
continuity of the perception and can simplify the program development process
as the programmer does not need to worry about the message-passing mechanisms
on individual machines. It also simplifies data partitioning and dynamic load
balancing and improves the portability of programs across machines with different
architectures.
(c) Because of high synchronization cost, coarse-grain parallelism necessitates the al-
location of a large chunk of computations, such as several iterations, to each pro-
cessor. As a result, low-level parallelism such as individual iteration or instruction
is not fully exploited. From scalability point of view, as the number of processors is
increased, it is important to take advantage of such Jow-level parallelism in order to
reduce solution time and improve processor utilization. The consideration favors
the use of fine-grain parallelisin over medfum- or coarse-grain parallelism.Chapter 10
Parallel Models, Languages and
Compilers
Problem 10.1
(a) In synchronous message-passing, the sender and the receiver must be synchronized
in time and space. In other words, a communfcaiton channel must be established
before message passing can convene, much like communication over a telephone
line. No buffering is needed on the channel.
In the case of asynchronous message-passing, it is not necessary to coordinate
the sender and receiver. A message is delivered to the channel and may be stored
in the buffers on the channel ot a global mailbox before arriving at the sender.
In this scheme, an acknowledge from the receiver is needed to signal the correct
receipt of a message
(b) In synchronous message passing, if a channel cannot be established between two
communicating processes, the message will be blocked, which in turn will block the
execution of the processes involved. On the other hand, in asynchronous message
passing, as long as the channel buffer is sufficiently large, the transmission of
messages and execution of processes will not be blocked. Therefore, it offers better
resource utilization and potentially shorter communication delays
{c) Rendezvous is a scheme adopted in Ada for synchronous message passing. In this
scheme, a sender or receiver arriving earlier at the rendezvous has to wait for the
arrival of the other before they can proceed to exchange messages.
{d) In a name-addressing scheme, a sender or receiver process is identified by the
process ID and the node in which it resides. This convention is adopted by Ada. In
a channel-addressing scheme, a path is established between a sender and a receiver
process by specifying the channels connecting the nodes in which the processes
reside.
151152 Parallel Models, Languages and. Compilers
(e) In asynchronous message passing, the sender and receiver processes are effectively
uncoupled from each other via the use of mediaries such as channel buffers or a
global mailbox. Through this uncoupling, both processes can execute more freely,
leaving the transmission of messages to be handled by’ the mechanisms provided
by the communication channels.
(4) Both interrupt and lost messages can occur in asynchronous message-passing sys-
tems. An interrupt message differs from a regular message in that it has to be
handled immediately by the receiver process, even though the receiver may not
expect to receive it. After it is serviced, the interrupted process can resume its
execution,
Lost messages are those directed to a wrong process or node and eventually
are lost. It is important to design effective detection and debugging facilities to
redirect lost messages to the correct receivers to ensure smooth program execution.
Problem 10.2 The idea is to add fork-join primitives into the code to allow parallel
execution of the program. Different concurrent Lisp languages, such as Multilisp, Qlisp.
Symmetric Lisp, and Connection Machine Lisp, have different syntax and semantics.
‘A. Lisp-like code seginent based on concurrent object-oriented model can be found in
{Aghag0}. In practice, Lisp language available on an accessible machine should be used
to write a program to carry out the computations. Performance data can then be
collected and analyzed.
Problem 10.3
(2) C* is a data parallel language developed by Thinking Machines. It provides high-
level constructs for parallel computing on SIMD machines. Quinn and Hatcher
described compiling and various optimization techniques to convert a program
written in C* to one in C for execution on SPMD or MIMD machines. Four issues
were addressed in their paper:
«= how to infer message-passing requirement?
© how to support synchronization requirement?
‘> how to emulate @ large number of PEs efficiently on a machine without
hardware support for virtual processors?
© how to minimize message-passing cost?
Several methods to deal with these problems were discussed by the authors, includ
ing reduction of synchronization and message-passing, For instance, in order to
reduce message-passing cost, instruction and data can be replicated on all nodes.
‘Also, data exchange can be carried out in blocks instead of bytes to reduce startup
overhead. Their experiments with Gaussian elimination on an nCube 3200 showed
that a C program generated from translation of C* code with message optimization
‘was comparable in quality to a hand-coded C program.
{b) SIMD mode is synchronous in that all active PEs execute the same operations in a
lockstep fashion. Tt is especially suitable for data parallel computations. In SPMDParallel Models, Languages and Compilers 153
mode every PE executes the same program in an asynchronous manner. PEs
‘coordinate with each other at synchronization points but otherwise each PE works
at its own pace between those points. Synchronization is achieved by message
Passing among processors. Asynchronous algorithms executed in SPMD mode
are prone to time-dependent errors, In contrast, SIMD execution has simple flow
control, and the computation results are deterministic regardless of the number of
PEs. However, not all applications are suitable for execution in SIMD mode.
(c) See the optimization described in the paper for Gaussian elimination and conduct
similar optimization for FFT after an analysis of the program flow.
Problem 10.4
(a) Multiprogramming refers to the interleaved execution of multiple indepondent pro-
grams on a uniprocessor or multiprocessor system through time sharing. Its use is
intended to overlap CPU and I/O operations among programs ta improve resource
utilization.
(b) Multiprocessing is multiprogramming implemented at process level on a multipro-
cessor. If interprocessor communications are handled at instruction Jevel, the mode
of operation is MIMD multiprocessing and exploits fine-grain parallelism.
(c) Multiprocessing, in which interprocessor communication takes place at program,
procedural, or subroutine level, is characterized as operating in MPMD mode. In
this mode, coarse-grain parallelism is exploited,
(4) When a single program is divided into several interrelated tasks which can be
executed concurrently on a multiprocessor, the mode of operation is referred to as,
multitasking.
(e) Multithreading is a refinement of multitasking and multiprocessing concept. A
task can create multiple threads which are executed on one or more processors at
the same time. Since threads are lightweight processes with minimum state and
register information, context switching is much faster than in multiprogramming.
(1) Program partitioning refers to the decomposition of a large program and data sets
into small pieces which can be executed in parallel on multiple processors,
Problem 10.5
(a) 1. AG8,1), A(6,9,1), AG,10,1), 4(5,8,2), A(5,9,2), A(5,10,2), A(5,8,3), A(5.8,3),
A(5,10,3), A(G5,8,4), A(5,9,4), A(5,10,4), A(5,8,5), A(5,9,5), A(5,10,5}
2, B(3,5), B(3,6), B(3,7), B(3,8}, B(6,5), B(6,6), B(6,7), B(6,8), B(9,5), B(9,6),
B(9,7), B(9,8).
3. C(1,3,4), C[2,3,4), C(3,3,4).
(b) 1. Yes. The number of elements is the same in each dimension of the source and
destination arrays184 Parallel Models, Languages and Compilers
2. No, because the two arrays have different sizes in the first. dimension.
3. No, because the two arrays have different dimensions.
4, Yes.
Problem 10.6
(2) Flow dependence between statements 5; and $; in successive iterations of Joop.
‘The distance vector is (0,1), and the direction vector is (=, <).
(b) Flow dependence between statements $ and Sp. The distance vector is (0,0), and
the direction vector is (=,=)
{c) Antidependence between statements S; and Ss in successive -loop. The distance
vector is (—1,0), and the direction vector is (>=).
Problem 10.7
i 8; + Sr ++ Ss.
(b) The vectorized code is as follows:
(EN) = BEN) .
E(L:N) = C(2:N+1)
C(ULN) = A(LN) + B(LN)
‘Note that it is necessary to store the original value of C’in E before C is overwritten.
‘Therefore, the order of statements Sp and S3 in the original loop is reversed in the
vector code, It is also permissible to interchange the first two vector statements,
since they are independent.
Problem 10.8
(a) 1. There is flow dependence on variable A between statements S; and Ss in
successive iterations of J-loop. The distance vector is (0,1), and the direction
vector is (=, <)-
2, There is flow dependence on variable E between statements Sy and Sj in
successive iterations of J-loop. The distance vector is (0,1), and the direction.
vector is (=<)
3. There is antidependence on variable C between statements $, and Sz in the
same iteration. The distance vector is (0,0), and the direction vector is (=.=).
(b) There is no data dependence among different I-loop iterations. Therefore, ‘they,
can executed in parallel, The compiler can preschedule the iterations of the Hoop
into P processors in contiguous blocks as follows:
processor 1 executes iterations 1, 2, ... [N/P];
processor 2 executes iterations [/P] + 1, [N/P) + 2.2 [N/PIsParallel Models, Languages and Compilers 155
Alternatively, every Pth iteration can be assigned to the same processor:
processor 1 executes iterations 1, P +1, ... 2P +1;
processor 2 executes iterations 2, P+ 2, .., 2P +2;
Problem 10.9
(a) The loop can be compiled with the Iioop in vector mode, which will generate
stride-1 memory operations.
(b) The loop can be compiled for parallelization in the J-loop as follows:
Doacross J = 1,N
Si: A(LN, J4+1)
signal(J)
if (J > 1) wait (J-1)
So: D(LN,J) = AGN, J) /2
Endacross
(LN, J) + CCN, J)
‘The parallel execution is illustrated in the following diagram for the case of
two processors.
Processor 1 Processor 2
AQ:N2)= BOND +CQ:NA) AGEN) = BC:Na) + COHN)
sige signal(2) 7
Ht pann=aanne swait(}) pe
D(LN2)= ACN)
AUNA)= BUND*CENS) AUN S) = BUENA) + CCLNA)
— Signal)
ay ——.. an Jet
Pj Pitts ---» Pa) t0 (Pry oy Pi-t) Piy Pitty
> Bi-1s Bs + {Pin Biss «++ Pn}:
‘The above three transformations can be formulated as elementary matrix op-
erations. See the text for matrix representations and examples.
(d) Loop tiling refers to various techniques of breaking iterations into small blocks to
obtain coarser granularity which can reduce synchronization overhead and improve
data locality. Typically an n-deep loop is converted into a 2n-deep loop, where the
inner n loops are determined by the tile size used.
(c) Wavefront transformation is a technique to maximize the degree of parallelism in
‘n fully permutable loops with dependences. The idea is to skew the innermost in
the nest with respect to each of the other loops and then move the innermost loop
to the outermost position. See text for examples.
(4) Locality optimization is used to reduce memory access penalties. The idea is to
improve the reuse of data once it is brought into a level of the memory hierarchy
which is closer to the processor. Such techniques as loop interchange, instruction
and data prefetching, and tiling can be used to achieve the goal.
(g) Software pipelining is the pipelining of successive iterations of a loop in a source
program. It is particularly suited to deep hardware pipelines and can be used with
either Doall or Doacross loops. Similar to hardware pipelining, it is desirable to
minimize instruction initiation latency.
Problem 10.11
(a) In iteration 1, A(Z) is updated by the value of A(I+1). The value of A(I +1)
is not updated until the (I + L)st iteration, which has not been executed yet. In
general, with forward LCD, the reference to an clement always occurs before its
value is updated in a later iteration. This type of operations can be vectorized. In
effect, the computations in the loop add a scalar constant 3.14159 to each element
of A and then shift them forward by one position, In other words, the loop is
equivalent to the following vector code:
A(2N+1) + 3.14159
A(LN) = VON)
(b) ‘The assignment to A(2) in the second iteration depends on the value assigned to
B(2) in the first iteration. ‘The compiler can interchange the statements within
the loop so that the assignment to B occurs before the assignment to A, as shown
below:
Dot=1,N=1Parallel Models, Languages and Compilers 157
B(I+1) = D() + 3.14159
A() = BQ) + C()
Enddo
‘The code can be vectorized as follows:
B(2:N) = D(LN-2) « 3.14159
AQEN=1) = BOLN=1) +°O(L:N-1)
Problem 10.12
(a) This program can be vectorized as follows:
ACN)
TEMP(I:N) + 3.14159
(b) The code cannot be directly vectorized or parallelized because of the carry-around
variables $ and X. To see this, consider the following parallel code:
Doall I= 1,N
If(A() .LE. 0.0) then
S=S+B() + C(i)
X= BI)
Endif
Enddo
If all processors are allowed to proceed concurrently, the values of $ and X will be
nondeterministic. On the contrary, the serial code will give a definite answer for S
and X.
However, if intermediate vectors are introduced to store the value of B(I) *
C(Z), then some vector or parallel processing can be achieved. This is illustrated
in the following code for performing the conditional inner-product operation:
D(LN) = 0
where (A(I:N) LE. 0.0) do
D(EN) = BON) + C(LN)
endwhere
See {Wolfe89] for more details. The elements of D can then be summed up in
parallel using a binary tree computing structure to obtain S. Alternatively, $ can
be obtained by a vector reduction operation as follows:
S = suin(D(L:N))
Similarly, the determination of X in the original loop can be vectorized. Let
vector P be initialized so that P(J) = I for I= 1..V, and Q is a zero vector. The
following voctor code yields the desived result:158 Parallel Models, Languages and Compilers
where (A(I:N) LE. 0.0) do
Q(LN) = P(L:N)
endwhere
K = max(Q(1:N))
X= B(K)
In the above, maz is a vector reduction function which finds the mazxisum value
of a vector. Of course, the performance of the vector code will depend on how
fast vectors P and Q can be generated. Typically, P and (initial) Q are created at
compile time since their elements are fixed. Then the cost can be amortized over
a large number of executions of the vector code.
Problem 10.13 Tanenbaum et al. proposed a laycred approach to provide a uniform
interface for parallel programming. The approach is insensitive to machine architec-
ture and can be used with multiprocessors or multicomputers. Besides architecture-
transparency, the other goal is to maintain a good performance in a distributed shared
memory system. The two major components of the system are shared objects and reli-
able broadcasting. An object is an abstract data type with well-defined operations. For
instance, an object can be a data structure with read and write operations.
‘An object that is shared by multiple processes are replicated for each process that
needs to access it. When a process performs a read operation on a shared object, it,
igs treated as an operation on a private object and can be done locally with proper
synchronization. When a write operation is performed on a shared object, the updated
value needs to be sent to other processes via the reliable broadcasting mechanism.
In general, read operations occur much more frequently than write operations.
‘Therefore, replicating and sharing data objects can be effective. Moreover, the low
overhead associated with reliable broadcasting (at most 2 sends for each message) allows
the system to scale up in performance. Consult [Tanenbaum92| for more details about
the broadcasting protocols and object management schemes.Chapter 11
Parallel Program Development
and Environments
Problem 11.1
(4) In busy wait, a process waiting for an event remains loaded in the context registers,
of a processor and keeps trying to get into a critical section. In slecp wait, a
waiting process is removed from the processor and put in a wait queue. Later on,
after the event it is waiting for takes place, the suspended process is awaken and
rescheduled,
(b) In sleep wait, a policy is needed to select one of the suspended processes in the
wait queue to be revived. The policy must ensure that all suspended processes
in the queue are treated fairly. ‘That is, no process should be suspended for an
extraordinary amount of time compared to others. For instance, a first-come-first-
served revival policy is a fairness policy.
(c) Lock is a mechanism used to implement presynchronization in which a requester
Process is required to obtain sole access to an atom (shared writable object) before
performing an operation to update it. The purpose is to avoid concurrent updates
to an object.
(d) Optimistic synchronization (or postsynchronization) allows an atom to be updated
before sole access is granted to a requester process. This is achieved in two steps.
First, the requester modifies a local version of the object. Second, it checks to see
if there has been a concurrent update to the global version. If so, the local update
is aborted; otherwise, the global version is updated.
(e) In server synchronization, each atom is associated with an update server. Any pro-
‘cess that wishes to perform an atomic operation on an atom has to do so by sending
a request to the server. This approach is often adopted in object-oriented systems
to provide data encapsulation. The corresponding synchronization environment is
often more user friendly as the user does not need to know or worry about the
ase160 Parallel Program Development and Environments
implementation details of mutual exclusion mechanisms for synchronization. This
strategy is adopted in monitors for synchronization and can be implemented as
server daemons.
Problem 11.2
(a) Lock is a mechanism used to ensure sole access to a critical section. If a spin lock
is used, a process waiting to enter the critical section will keep on trying until it
{gains access. In the case of a suspend lock, once a process is denied access to the
critical section, it is suspended and put in a queue. Suspended processes will be
activated one by one when access to the critical section is allowed. Suspend lock
allows more efficient use of the processor than spin lock but care must be taken to
guard against indefinite waiting for some processes.
(b) Dekker’s algorithm for synchronization ensures mutual exclusion and avoids un-
necessary waiting. To accomplish this, each process uses a flag to indicate whether
it desires to enter the critical section. To achieve mutual exclusion, each process
checks whether there is another process in the critical section. If so, it backs
off. The following algorithm is described in [Silberschatz8s}. It uses an array
flag(0 : n — 1) to indicate the status of the processes. Each element of the array
can assume three values: idle, in, out. A global variable tum is used to select a
process between 0 and n ~ 1. Initially, all the elements of the flag array is set to
idle and turn can assume any valid value. An auxiliary integer variable j is also
used in the algorithm. In this algorithm, each process #, 0 <7 n;
turn =
critical section
fiag(i) = idle;
j= i41 moda;
while (j # i && flag(i)
turn = j;
exit {critical section }
in) j = j+1 mod m5
=}41 mod a;Parallel Program Development and Environments 161
Note that initially it is possible for soveral processes to set their flags to in at the
same time. If that happens, all of these processes will be forced to reset their flags
to out. On the second try, only one of them will be able to enter and set its flag to
fry, the others will be blocked and spin wait. When an incumbent process exits the
critical section, it selects the next process to enter the critical section in an orderly
manner. This guarantee that any process wishing to enter the critical section will
be able to do so after at most nm — 2 tries.
(c) The generalized Dekker's algorithm can be implemented using Test&Set. Each
process is associated with a flag, which can be examined and/or changed by all
the processes. In addition, each process has a local variable key, which can only
be updated by the owing process. A global variable lock is used to guard the
entrance to a critical section. Initially, all the flags are set to false. Each process i
wishing to enter the critical section executes the following code, also adapted from
(Silberschatz88}
fiag(i) = true;
true;
while (flag(i) && key) key
fiag(i) = false;
critical section
j=i+1modn;
while (j i && flag(i)
Test&Set(lock); End
false) j = j+1 mod n; End
if (| == 1) lock = false;
else flag(i) = false;
endif,
‘This code uses atomic Test&Set operation to ensure mutual exclusion of the critical
section. ‘The method used to select the next process is similar to the software
algorithm in (b).
Problem 11.3
(a) A binary semaphore is a variable which can assume value 1 or 0. It has two
associated operations P and V, corresponding to wait and signal. A process wishing
to enter a critical section first performs a P operation to see if another process is
already in the critical section. If that is the case, it is blocked. When a. process
leaves the critical section, it performs a V operation, thus allowing a waiting process
to be awaken and enter the critical section, A binary semaphore is initialized to
1 to allow the first process to enter the critical section without. waiting. It can be
implemented in hardware using atomic operations such as Test&Set.
(b) Monitor is a high-level construct that encapsulates shared variables and associated
procedures into a module. A monitor consists of (1) local variables, (2) procedures
that manipulate local variables, global variables, and parameters passed from call-
ing processes, and (3) initialization of local variables. Only the values of local
variables can be changed by the procedures. Also, only one process is allowed to162 Parallel Program Development and Environments
be in the monitor at a time, Thus mutual exclusion mechanism is embedded in the
construct. Monitor relieves individual processes of the need to take care of mutual
‘exclusion in the code and reduces the possibility of errors. For instance, in the use
of binary semaphore, if the semaphore is not initialized to 1, processes that wish
to enter the critical section will hang up indefinitely. With the use of monitors,
the debugging process is simplified by getting rid of inadvertent mistakes.
Problem 11.4 Let the philosophers and forks both be numbered 0 to 4. The'fork to
the right of philosopher i is fork ¢ and the one to his left is fork (i ~ 1) mod 5.
(a) Let forks(0:4) be the semaphores associated. with the forks and all its elements are
initialized to-1 at the beginning.
In the fetch protocol, an even-numbered philosopher first picks up the fork to
his right and then the one to his left. An odd-numbered philosopher first picks up
the fork to his left and then the one to his right. In the release protocol, both forks
are put down in a random order.
Fetch protocol
if (i mod 2 == 0) then
P(fork(i));
P(forks((i=1) mod 5));
else
P(forks((i-1) mod 5)};
P(forks(i));
endif
Release protocol
V(forks((i-1) mod 5)};
V(fork(i));
‘This protocol allows a philosopher to hold a fork while waiting for the other.
Deadlocks are avoided by breaking circular waits among the philosophers which
is a necessary condition for deadlock to occur. Based on the protocol, at least:
‘one philosopher will be able to eat at any moment, Moreover, a philosopher will
pick up the first fork as soon as it becomes available instead of waiting until both
forks on his sides are available. This prevents conspiracy between two philosophers
to starve a third philosopher seated between them. ‘Therefore, starvation is also
avoided.
(b) The above fetch and release protocols can be implemented using moniter as follows:
Monitor dining-philosophers
forks(0:4): condition;
procedure fetch(i)
begin
if (i mod
) thenParallel Program Development and Environments 163
wait(fork(i mod 5));
wait(fork(i-1 mod 5));
else
wait(fork(i-1 mod 5));
wait(fork(i mod 5));
endif,
end
procedure release(i)
begin
signal(fork(i-1 mod 5));
signal(fork(i mod 5);
end
Problem 11.5 A set of processes are in a state of deadlock when every process in
the set is waiting for resources held by another process in the set. According to the
definition, we know that the four conditions — hold and wait, no preemption, mutual
exclusion, and circular waiting —- must hold at the same time to cause a deadlock. If
any of the conditions is false, then deadlock can be prevented. For example, if resource
is sharable by more than one process or can be preempted, then there is no need to wait
for the resource. Circular waiting is implied in the definition of deadlock. Finally, if a
Process does not hold resources while waiting for others, these resources can be used by
other processes, thereby breaking the stalemate situation.
When all the four conditions hold simultaneously, a deadlock situation will poten-
tially occur. But a deadiock can often be averted by properly revising the resource
allocation diagram to eliminate circular waiting,
Deadlock prevention refers to the use of suitable protocols to ensure that at least
one of the four necessary conditions for deadlock will not hold and thus the occurrence
of deadlock is prevented.
Deadlock avoidance refers to the management of resources so that situations that
may lead to deadlock will be avoided. Usually it is achieved by dynamically keeping
track of the resources available, allocated, and requested. The operating system elesely
monitors the usage of resources to avoid deadlocks.
Deadlock detection is a systematic approach for detecting whether a deadlock sit-
uation is present. When no deadlock prevention or avoidance measure is employed,
deadlocks may occur and need to be detected so that a deadlock recovery algorithm can
be invoked.
When a deadlock is detected, a deadlock recovery strategy is used to break it. Two
options are often adopted. One is to kill one or more of the deadlocked processes to
remove the circular waiting, The other is to preempt some of the resources held by one
or more of the deadlocked processes.164 Parallel Program Development and Environments
Problem 11.6
fa)
© B and D do not cause deadlock because only after B releases Sy can D
claim 8. If A is executed before B, there will be no deadtock between A
and D, either. But if B is executed before A, then A and D can enter a
deadlock, with D holding S; and S, while waiting for S, and A holding $1
while waiting for S.
© Cand E-can be in deadlock. After C gets Sp (P(S2)) and E gets Ss (P(S3)),
C claims $3 and E claims $2 which can never be satisfied.
(b) If C and E are deadlocked, A, B, and D will be blocked indefinitely. If A and D.
are deadlocked, C and E will be blocked indefinitely.
{c} It depends on the race conditions. For instance, If C (or E) can secure both Sp and
Sp before E (or C), it will have all the needed resources. After it finishes execution,
both resources are released so that E (or C) can proceed. Thus, deadlock is avoided.
Similarly, the deadlock between A and D also depends on race condition.
(d} The deadlock between C and E can be prevented by either of the following two
options which alter the resource usage in C and E:
in C: PSs); P(S.); .. or
in E: P(S2); P(Ss); -
‘The resulting resource allocation graphs are shown below; there is no circular
wait between C and E now.Parallel Program Development and Environments 165
Problem 11.7
(a) Suppose on the disk there aren cylinders numbered 0 through n — 1 starting from
the innermost one. An “elevator” algorithm is used in the scheduling. The idea
is to continue sweeping in inbound or outbound direction until all requests in that
direction have been serviced. Then the sweeping direction is reversed. For details,
see [Bic88]
When a request is made and the disk head is busy, the request is put in
either of two queues, one (insweep) corresponds to inward movement and the other
(outsweep} to outward movement of the disk head. The queued requests are served
according to the position of the destination cylinder.
The scheduler can be implemented by a monitor with conditional wait. See
{Silberschatz88]. If a request is put in the outsweep queue, the distance between
the destination. and innermost cylinders (dest) is stared with the request. If a
request is put in the insweep queue, the distance between the destination and
outermost cylinders (n — dest) is stored. The requests are then serviced in the
order determined by the number: the smaller the number, the earlier a request is
serviced. Clearly, the motivation for the policy is to reduce the movement of the
disk head. The following monitor implementation is adapted from [Bic88}.
Monitor disk-scheduler
type direction = (in, out);
dest, pos: integer;
dir: direction;
busy: boolean;
incount, outcount: integer;
insweep, outswesp: condition;
procedure request (dest);
begin
if busy
if (pos < dest) || (pos == dest &e&e dir
‘ontsweep.wait(dest);
outcount = outcount + 1;
else
insweep.wait(a ~ dest);
incount = incount + 1;
endif
else
busy = true;
pos = dest;
endif
end
procedure release;
begin
busy = false;166 Parallel Program Developinent and Environments
if dir == out
if outcount > 0
‘outcount = outcount — 1;
coutsweep.signal;
else
dir = in;
incount = incount ~
insweep signal;
endif
else
if incount > 0
jincount, = incount ~
insweep signal;
else
dir = out;
outcount = outcount
outsweep signal:
endif
endif
end
begin
dir = in; pos = n—1; busy = false;
incount = 0; outcount = 0;
end
In the above program, the syntax of the signal and wait instructions is slightly
changed to accommodate the priority parameter.
(b) A user process can access data on the disk by the following sequence:
request (cytmam);
call driver procedure to transfer the data
release
‘The cylnum indicates the location on the disk where the requested data resides
Tt can be generated by the file server from user-specified information
Problem 11.8 A monitor for a barrier counter can be specified as follows
Monitor batrier-counter
var counter: integer;
flag: condition;
procedure block(n)
begin
counter = counter + 1;
if (counter == n) thenParallel Program Development and Environments 167
begin
for (i= Lyi ) POSIX (Portable operating system interface for UNIX) is an attempt to stan-
dardize operating systems so that applications conforming to the POSIX standard
are portable from one platform to another. In 1985, IEEE defined POSIX stan-
dards with FIPS 151-1, It was declared by National Institute of Standards and
‘Technology to be the standard interface for government open systems in 1990,
Many vendors have subsequently come up with operating systems that comply
with POSIX. OSF/i compliance with POSIX includes shells, real-time computing,
security facilities, transparent file-access support, protoco!-independent interpro-
coss communication, etc.
(c) Program development environment contains a set of tools, induding editors, com-
pilers, linker, and debugger based on packages developed by Free Software Foun-
dations. Major UNIX shells are also supported, OSF/I environment supports
application program construction through a layered approach, with applications
on top of user libraries and system libraries, which in turn are supported by OS
kernels, Shared libraries reduce space requirement, improve performance, and
Tower developing and debugging cost. Separate compilation and dynamic linking,
helps modular development of application programs. Position-independent codeUNIX, Mach and OSP/1 for Parallel! Computers 181
placement also improves performance, among other benefits.
Problem 12.12
(a) A Pthread is a thread as defined in POSIX standard, Each thread has a single
sequential line of control and is intended to carry out a small self-contained job.
Motivations for the use of Pthread are enumerated below:
‘+ Use of Pthreads enables cross-development of multiprocessor programs on
‘a uniprocessor system or different platforms.
* A server task can spawn several threads to serve multiple requests. Doing
80 improves resource utilization with a light overhead. While a thread is
blocked, others can be running. On the system with multiple processors,
the requésts can be services concurrently.
+ Independent threads can be executing in different states. Multiple threads
allow computation, communication, aud 1/0 activities to be overlapped.
‘+ Multiple threads allow asynchronous events to be handled more efficiently
by preventing inconvenient interrupts and avoiding complex flow control.
(b) The database may be shared among several programs. A user program wishing to
retrieve/update the database send a request to the server through a communica-
tion channel. The server then spawns a thread to serve the request. Since there
might be several threads trying to access the database, proper synchronization is
needed to prevent simultaneous updates to the data. This is provided by a global
lock db.muter which ensures that operations on the database are performed in a
critical section. The lock can be envisioned as a semaphore, initialized to 1 at the
beginning. Then Pthread.mutex lock and Pthread mutex-unlock can be viewed as
P and V operations, respectively. See Chapter 11 for more details
Problem 12.13
(a) LINPACK is a package developed by Jack Dongarra, Jim Bunch, Clove Moler
and Pete Stewart for solving linear equations and linear least squares problems.
See [Dongarra79] for a more detailed description of the package and its usage.
LINPACK has been widely used as a benchmark to determine the performance of
various computer systems. See {Dongarra92]
Tt can deal with linear systems whose matrices are general, banded, symmetric
indefinite, symmetric positive definite, triangular, and tridiagonal square. It uses
Gaussian elimination with pivoting and Cholesky factorization (for symmetric pos-
itive definite matrices) to decompose a matrix. In addition, the package computes
the QR (by Householder transform) and singular value decompositions of rectan-
gular matzices and applies them to least squares problems. For a description of the
pertinent algorithms, please consult texts on numerical analysis or matrix algebra
such as (Dahlquist 74] and (Golubs9}.182 UNIX, Mach and OSF/1 for Parallel Computers
(b) Most machines provide vectorization and/or concurrentization support based on
extensive dependence analysis. Other optimization techniques such as loop inter
change may also be implemented. They also allow user interaction to optionally
enable or disable such optimizations. Please check the manuals of machines acces-
sible locally.
(c) Parallel 1/0 is important to the performance when the data set is large. Without
efficient support, I/O can become a bottleneck and degrade overall performance,
Parallel I/O is also desirable to support real-time monitoring of program activities
for performance tuning or debugging purpose. OS should also support effective
‘program partitioning, scheduling, and synchronization for the parallel execution of
LINPACK programs,
Problem 42.14 The degree of compiler support provided by different. machines may
vary widely. For instance, some systems use very primitive processors which may not be
able to perform vector operations, while other systems may have sophisticated processors
capable of efficient vector processing. Concurrency support is typically provided through
a library of system calls, which manages message passing and other activities. System
calls can be linked with user programs at compilation/linkage time. Parallel I/O support
is essential so that code and data can be quickly distributed to individual nodes and
the results be sent back to the host. Dynamic load balancing provided by the OS will
be valuable to the efficient utilization of system resources, especially when the matrix
is not regularly structured. Check relevant manuals for more detailed information.
Problem 12.15,
(a) If the conservative policy is used, at most. 20/4 = 5 processes can be active simul-
taneously. Since one of the drives allocated to each process can be idle most of the
time, at most 5 drives will be idle at a time. Ta the best case, none of the drives
will be idle.
(b) To improve the drive utilization, each process can be allocated three tape drives.
‘The fourth one will be allocated on demand. In this policy, at most [20/3} = 6
processes can be act simultaneously. The minimum number of idle drives is 0
and the maximum is 2.Bibliography
[Adam74] T. L. Adam, K. M. Chandy, and J. R. Dickson, ‘“A Comparison of List
Schedules for Parallel Processing Systems”, Commun. ACM, 17(12):685-690, 1974.
[Agha90] G. Agha, “Concurrent Object-Oriented Programming”, Commun. ACM,
33(9):125-141, Sept. 1990.
{Archibald86] J. Archibald and J. L, Baer. “Cache Coherence Protocols: Evalua-
tion Using a Multiprocessor Simulation Model”, ACM Trans. Computer Systems,
4(4):273-298, Nov. 1986.
{Berntsen90] J. Berntsen, “Communication-Efficient Matrix Multiplication on Hyper-
cubes”, Parallel Computing, pp. 335-342, 1990.
[Bic88] L. Bic and A. C. Shaw, The Logical Design of Operating Systems, 2ed., Prentice-
Hall, Englewood Cliffs, NJ, 1988,
{Connon69] LE. Cannon, A Cellular Computer to Implement the Kalman Filter Al-
gorithm, Ph.D. thesis, Montana State University, 1968.
{Caswell90] D. Caswell and D. Black, “Implementing a Mach Debugger for Multithread
Applications”, Proc. Winter 1990 USENIX Conf., Washington, DC, Jan. 1990,
[Chan86] T-F. Chan and Y. Saad, “Multigrid Algorithms on the Hypercube Multipro-
cessor", IEEE Trans. Computers, 35(11):969-977, 1986.
(Dablquist74] G. Dahlquist and A. Bjdrck, Numerical Methods, Prentice-Hall, Engle-
wood Cliffs, 1974.
(Dongarra79] J. J. Dongarra et al., LINPACK: Users’ Guide, SIAM, Philadelphia,
1979
[Dongarra92] J. Dongarra; “Performance of Various Computers Using Standard Linear
Equations Software”, Technical report, Computer Science Department, University
of Tennessee, Knoxville, TN, 1992
{Dubois88] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, Coherence
and Event Ordering in Multiprocessors”, IEEE Computer, 21(2), 1988,
183184 BIBLIOGRAPHY
[Fox87] G. C. Fox, 8. W. Otto, and A. J. Hey, “Matrix Algorithms ox Hypercube (1):
‘Matrix Multiplication”, Parallel Computing, pp. 17-31, 1987.
{Furtney92] M. Fortney, “Parallel Processing at Cray Research, Inc.", in R. H. Petrott
(ed.), Software for Parallel Computers, pp. 133-154, Chapman & Hall, 1992
[Gharachosloo9i] K. Gharachorloo, A. Gupta, and J. Hennessy, “Performance Evalu-
ation of Memory Consistency Models for Shared-Memory Multiprocessors”, Proc.
Fourth Int, Conf. Arch. Support for Prog. Lang. and OS, 1991
[Golub89] GH. Golub and C-F. Van Loan, Mairie Computations, second ed., The
Johns Hopkins University Press, 1989.
[Hossfeld89} F. Hossfeld, R. Knecht, and W. E. Nagel, “Multitasking: Experience with
Applications on a Cray X-MP", Parallel Computing, 12:259-283, 1989.
[Hwang84] K. Hwang and P. A. Briggs, Computer Architecture and Parallel Processing,
McGraw-Ilill, New York, 1984.
[Hwang] K. Hwang and C. M. Cheng, “Simulated Performance of A RISC-Based
Multiprocessor Using Orthogonal Access Memory”, J. Pare. Distri. Computing,
1343-57, 1991
[J23a92 J. JaJa, An Introduction to Parallel Algorithms, Addison-Wesley, Reading,
MA, 1992.
[ohnsson89] S. L. Johnsson and C. T. Ho, “Optimal Broadcasting and Personalized
Communication in Hypercubes", IEEE Trans. Computers, 38(9)-1249-1268, Sept.
1989.
[Konicekot] J. Konicek et al., “The Organization of the Cedar System”, Proc. Int.
Conf. Parallel Processing, pp. volume 1, 49-56, 1991.
[Leighton92] F. T. Leighton, Introduction to Parallel Algorithms and Architectures,
‘Morgan-Kaufmann, 1992
[Li89] K. Li and P. Hudak, “Memory Coherence in Shared-Memory Systems", ACM
Trans. Computer Systems, pp. 321-359, Nov. 1989.
[Mosberger93] D. Mosberger, “Memory Consistency Models”, Operating Systems Re-
view, 27(1):18-26, Jan. 1993,
{Quinn87] M. J. Quinn, Designing Efficient Algorithms for Parallel Commuters,
‘McGraw-Hill, New York, 1987.
(Saavedra90] R. H. Saavedra and D.E. Culler, “Analysis of Multithreaded Architectures
for Parallel Computing”, Proc. ACM Symp. Parallel Algorithms ond Architecture,
Greece, July 1990.
[Silberschatz88] A. Silberschatz and J. Peterson, Operating System Concepts, Alternate
Edition, Addison-Wesley, Reading, MA, 1988.BIBLIOGRAPHY 185
[Stone90} H. $. Stone, High-Performance Computer Architecture, Addison-Wesley,
Reading, MA, 1990.
(Tanenbaum92} A. S, Tanenbaum, M. F, Kaashook, and H. E. Bal, “Parallel Program-
gr
ming Using Shared Objects and Broadcasting", IEEE Computer, 25(8):10-20,
1992,
[Wang89] J. Wang et al., “On the Communication Structures of Hyper-Ring and Hy-
ig B
percube Multicomputers”, J. Computer Sci, Tech., 4(1), Jan. 1989.
[Wolfe89] M. J. Wolfe, “Automatic Vectorization, Data Dependence, and Optimizations
for Parallel Computers”, in Hwang and DeGroot (eds.), Parallel Processing for
Supercomputing and Artificial Intelligence, Chapter 11, McGraw-Hill, New York,
1989.
[¥ang89] Q. Yang, L. N. Bhuyan, and B. Liu, “Analysis and Comparison of Cache Co-
herence Protocols for a Packet-Switched Multiprocessor”, IEEE Trans. Computers,
38(8):1143-1153, Aug. 1989.
[Young87] M. W. Young, A. Tevanian, R. F. Rashid, D. B. Golub, J. Eppinger, J. Chew,
W. Botosky, D. L. Black, and R. Baron, “The Duality of Memory and Communi-
cation in the Implementation of a Multiprocessor Operating System”, Proc. 11th
' ACM Symp. Operating System Principles, pp. 63-76, 1987.