Distributed Memory Machines
Distributed Memory Machines
Intel Paragon, Cray T3E, IBM SP
Each processor is connected to its own memory and
cache:
cannot directly access another processors memory.
Each node has a network interface (NI) for all
communication and synchronization
Arvind Krishnamurthy
Fall 2004
Key issues: design of NI and interconnection topology
P1
NI
P2
memory
memory
NI
Pn
...
NI
memory
interconnect
Historical Perspective
Early machines were:
Collection of microprocessors
bi-directional queues between neighbors
Messages were forwarded by processors on path
Strong emphasis on topology in algorithms
Network Analogy
To have a large number of transfers occurring at once, you
need a large number of distinct wires
Networks are like streets
link = street
switch = intersection
distances (hops) = number of blocks traveled
routing algorithm = travel plans
Important Properties:
latency: how long to get somewhere in the network
bandwidth: how much data can be moved per unit time
limited by the number of wires
and the rate at which each wire can accept data
Network Characteristics
Topology - how things are connected
two types of nodes: hosts and switches
Question: what nice properties do we want the network topology to
possess?
Topology Properties
Routing algorithm - paths used
Effective bandwidth lower due to packet overhead
Bisection bandwidth
how data in a message traverses a route
circuit switching vs. packet switching
Flow control - what if there is congestion
if two or more messages attempt to use the same channel
may stall, move to buffers, reroute, discard, etc.
Routing
and control
header
w is the number of wires
t is the time per bit
Data
payload
Error code
Switching strategy
Trailer
e.g., all east-west then all north-south in a mesh
Routing Distance - number of links on route. Minimize average
distance
Diameter is the maximum shortest path between two nodes
A network is partitioned if some nodes cannot reach others
The bandwidth of a link is: w * 1/t
sum of the minimum number of channels which, if removed, will partition
the network
Linear and Ring Topologies
Meshes and Tori
Linear array
2D Mesh:
diameter is n-1, average distance ~2/3n
bisection bandwidth is 1
Torus or Ring
Used in algorithms with 1D arrays
Diameter: 2 n
Bisection bandwidth: n
diameter is n/2, average distance is n/3
bisection bandwidth is 2
Generalizes to 3D and higher dimensions
Cray T3D/T3E uses a 3D torus
Often easy to implement algorithms that use 2D-3D arrays
Hypercubes
Trees
Number of nodes n = 2d for dimension d
Diameter: d
Bisection bandwidth is n/2
Popular in early machines (Intel iPSC, NCUBE)
Diameter: log n
Bisection bandwidth: 1
Easy layout as planar graph
Many tree algorithms (summation)
Fat trees avoid bisection bandwidth problem
Lots of clever algorithms
Greycode addressing
each node connected to d others with 1 bit different
110
010
111
011
100
000
101
001
Butterflies
more (or wider) links near top
example, Thinking Machines CM-5
Butterfly building block
Diameter: log n
Bisection bandwidth: n
Cost: lots of wires
Use in BBN Butterfly
Natural for FFT
Outline
O
Interconnection network issues:
Topology characteristics
Average routing distance
Diameter (maximum routing distance)
Bisection bandwidth
Link, switch design
Switching
Packet switching vs. circuit switching
Store-&-forward vs. cut-through routing
Routing
Link Design/Engineering Space
Switches
Cable of one or more wires/fibers with connectors at the
ends attached to switches or interfaces
Narrow:
- control, data and timing
multiplexed on wire
Short:
- single logical
value at a time
Input
Buffer
Receiver
Input
Ports
Synchronous:
- source & dest on same
clock
Output
Buffer Transmiter
Output
Ports
Cross-bar
Long:
- stream of logical
values at a time
Control
Routing, Scheduling
Asynchronous:
- source encodes clock in
signal
Wide:
- control, data and timing
on separate wires
Switch Components
Output ports
synchronizer aligns data signal with local clock domain
essentially FIFO buffer
Crossbar
circuit switching: full path reserved for entire message
packet switching: message broken into separately-routed packets
transmitter (typically drives clock and data)
Input ports
Switching Strategies
connects each input to any output
degree limited by area or pinout
like the telephone
like the post office
Question: what are the pros and cons of circuit switching & packet
switching?
Store & forward vs. cut-through routing
Buffering
Control logic
C ut-T h rou gh R o utin g
Store & For w ard R o uting
S ou rc e
D e st
complexity depends on routing logic and scheduling algorithm
determine output port for each incoming packet
arbitrate among inputs directed at same output
3 2 1
3 2
3
D est
3 2 1 0
3 2 1 0
3 2 1
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
3 2
1 0
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
T im e
Outline
Interconnection network issues:
Topology characteristics
Average routing distance
Diameter (maximum routing distance)
Bisection bandwidth
Switching
Packet switching vs. circuit switching
Store-&-forward vs. cut-through routing
Link, switch design
Routing
3 2 1
Routing
Interconnection network provides multiple paths between a
pair of source-dest nodes
Routing algorithm determines
which of the possible paths are used as routes
how the route is determined
Question: what desirable properties should the routing
algorithm have?
Routing Mechanism
need to select output port for each input packet
P3
in a few cycles
Simple arithmetic in regular topologies
Routing Mechanism (cont)
ex: x, y routing in a grid
Encode distance to destination in header
west (-x)
x < 0
east (+x)
x > 0
south (-y)
x = 0, y < 0
north (+y)
x = 0, y > 0
processor
x = 0, y = 0
Dimension-order routing in k-ary meshes
message header carries series of port selects
used and stripped en route
Variable sized packets: CRC? Packet Format?
CS-2, Myrinet, MIT Arctic
message header carried index for next port at next switch
o = R[i]
table also gives index for following hop
o, I = R[i ]
ATM, HPPI
Properties of Routing Algorithms
Deterministic
Adaptive
Minimal
Deadlocks
route determined by (source, dest), not intermediate state (i.e.,
traffic)
How can it arise?
route influenced by traffic along the way
only selects shortest paths
Deadlock free
no traffic pattern can lead to a situation where no packets cannot
move forward
resources are logically associated with channels
messages introduce dependences between resources as they move
forward
need to articulate the possible dependences that can arise between
channels;
show that there are no cycles in Channel Dependence Graph
find a numbering of channel resources such that every legal route follows a
monotonic sequence
=> no traffic pattern can lead to deadlock
constrain how channel resources are allocated
Question: how do we avoid deadlocks in a 2D mesh?
How do you prove that a routing algorithm is deadlock free
Proof Technique
necessary conditions:
shared resource
incrementally allocated
non-preemptible
think of a link/channel as a shared resource
that is acquired incrementally
source buffer then dest. buffer
channels along a route
How do you avoid it?
Example: 2D array
Theorem: x,y routing is deadlock free
Numbering
+x channel (i,y) (i+1,y) gets i
-x channels are numbered in the reverse direction
+y channel (x,j) (x,j+1) gets N+j
-y channels are numbered in the reverse direction
any routing sequence: x direction, turn, y direction is
increasing
1
2
3
00
network need not be acyclic, only channel dependence graph
P0
Table-driven
Reduce relative address of each dimension in order
P1
Source-based
P2
01
3
18
10
17
02
2
17
03
1
11
12
13
21
22
23
31
32
33
18
20
16
30
19
Channel Dependence Graph
Routing Deadlocks
Consider a message traveling from node 11 to node 12 and
then to node 22, and finally to node 32.
It obtains channels numbered 2 and then 18 and then 19.
00
18
10
17
01
2
17
02
1
11
1
2
1
18 17
18 17
1
2
13
20
21
22
23
31
32
33
19
30
0
18 17
1
2
1
2
0
17 18
17 18
16 19
16 19
18 17
3
17 18
17 18
16 19
16 19
3
2
1
If all turns are allowed, then channels are not obtained in
increasing order
Channel dependency graph will have a cycle:
Basic dimension order routing techniques dont work
with wrap-around edges
Idea: add channels!
Edges between 2:17, 17:1, 1:18, and 18:2
Question: what happens with a torus (or wraparound
connections)?
Deadlock free wormhole networks
18
18
16
17
03
0
12
How do we avoid deadlocks in such a situation?
Breaking deadlock with virtual channels
provide multiple virtual channels to break the dependence
cycle
Output
good for BW too! Input
Ports
Ports
Cross-Bar
Do not need to add links, or xbar, only buffer resources
Previous scheme removed edges
Packet switches
from lo to hi channel
This adds nodes to the CDG
Turn Restrictions in X,Y
Minimal turn restrictions in 2D
+y
+Y
+x
-x
-X
+X
West-first
-Y
XY routing forbids 4 of 8 turns and leaves no room
for adaptive routing
Can you allow more turns and still be deadlock free
north-last
-y
negative first
Example legal west-first routes
Adaptive Routing
Can route around failures or congestion
Can combine turn restrictions with virtual channels
R: C x N x -> C
Essential for fault tolerance
Can improve utilization of the network
Simple deterministic algorithms easily run into bad permutations
choices: fully/partially adaptive, minimal/non-minimal
can introduce complexity or anomalies
little adaptation goes a long way!
Up*-Down* routing
Given any bi-directional network
Construct a spanning tree
Number of the nodes increasing from leaves to roots
Just a topological sort of the spanning tree
Any Source -> Dest by UP*-DOWN* route
Topology Summary
up edges, single turn, down edges
Up edge: any edge going from a lower numbered node to higher
number
Down edges are the opposite
Not constrained to just using the spanning tree edges
Topology
Degree Diameter
Ave Dist
Bisection
D (D ave) @ P=1024
1D Array
N-1
N/3
huge
1D Ring
N/2
N/4
2D Mesh
2 (N1/2 - 1) 2/3 N1/2
N1/2
2D Torus
N1/2
1/2 N1/2
2N1/2
32 (16)
Butterfly
log N
log N
10 (10)
Hypercube
n =log N
n/2
N/2
10 (5)
n = 2 or n = 3
Performance?
Some numberings and routes much better than others
interacts with topology in strange ways
Short wires, easy to build; Many hops, low bisection
bandwidth
n >= 4
Harder to build, more wires, longer average length
Fewer hops, better bisection bandwidth
Butterfly Network
Low diameter:
Switches:
O(log N)
2 incoming links
2 outgoing links
Processors:
Connected to the first
and last levels
000
Routing in Butterfly Network
000
001
010
63 (21)
Routes:
001
010
011
011
100
100
101
101
110
110
111
111
Single path from a
source to a destination
Deterministic
Non-adaptive
Can run into congestion
Routing algorithm
Correct bits one at a
time
Consider: 001 111
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
Congestion
Easy to have two
routes share links
Congestion: worst case scenario
000
000
001
001
010
010
Bit reversal permutation:
Consider: 001 111
And 000 011
Consider just the following source-dest pairs:
How bad can it get?
Consider general
butterfly with 2r = log
N levels
Consider routing from:
Source: 000 111
Dest:
111 000
Must pass through
(after r): 000 000
011
011
100
100
101
101
110
110
111
111
Randomized Algorithm
Question:
Assume one packet from each source, assume random destinations
How many packets go through some intermediate switch at level k
in the network (on average)?
Sources that could generate a message:
Number of possible destinations: 2logN k
Expected congestion: 2k * 2logN k / 2N = 1
How do we deal with bad permutations?
2k
Turn them into two average-case behavior problems!
To route from source to dest:
Route from source to random node
Route from random node to destination
Turn initial routing problem into two average case permutations
Relationship Butterflies to Hypercubes
Why Butterfly networks?
Source: low-order r bits are zero
Of the form: b1 b2 br 0 0 0 0 0 0 br br-1 b1
All of these pass through 0 0 0 0 0 0 after r routing steps
How many such pairs exist?
Every combination of b b b
1 2
r
r
2r
Number of combinations : 2 = sqrt(2 ) = sqrt(N)
Bad permutations exist for all interconnection networks
Many networks perform well when you have locality or in the average
case
Average Case Behavior: Butterfly Networks
b1 b2 b2r-1 b2r b2r b2r-1 b2 b1
Equivalence to hypercubes and fat-trees
Fat Tree
Wiring is isomorphic
Except that Butterfly always takes log n steps
de Bruijn Network
Each node has two
outgoing links
Node x is connected to
2*x, and 2*x + 1
Example:
Node
Node
Node
Node
000
000
001
010
Summary
000
000
001
001
010
010
We covered:
Popular topologies
Routing issues
Cut-through/store-and-forward/packet-switching/circuit-switching
Deadlock-free routes:
is connected to
and Node 001
is connected to
and Node 011
011
011
How do we perform
routing on such a
network?
What is the diameter of
this network?
100
100
101
101
110
110
111
111
Limit paths
Introduce virtual channels
Link/switch design issues
Some popular routing algorithms
From software perspective:
All that matters is that the interconnection network takes a chunk of bytes
and communicates it to the target processor
Would be useful to abstract the interconnection network to some useful
performance metrics
Latency and Bandwidth
Linear Model of Communication Cost
How do you model and measure point-to-point
communication performance?
for short messages, latency dominates transfer time
for long messages, the bandwidth term dominates transfer time
What are short and long?
latency term = bandwidth term
when
latency = message_size/bandwidth
Critical message size = latency * bandwidth
Example: 50 us * 50 MB/s = 2500 bytes
mostly independent of source and destination!
linear is often a good approximation
piecewise linear is sometimes better
the latency/bandwidth model helps understand performance
A simple linear model:
messages longer than 2500 bytes are bandwidth dominated
messages shorter than 2500 bytes are latency dominated
data transfer time = latency + message size / bandwidth
latency is startup time, independent of message size
bandwidth is number of bytes per second
But linear model not enough
When can next transfer be initiated?
Can cost be overlapped?
LogGP Model
Using the Model
P ( processors )
P
o (overhead)
o
g (gap)
L (latency)
Interconnection Network
Time to send a large message:
L + o + size * G
Limited Volume
( L/ g to/from
a proc)
Latency in sending a (small) message between modules
overhead felt by the processor on sending or receiving message
gap between successive sends or receives
G: gap between successive bytes of the same message
Processors
Time to send n small messages from one processor to
another processor
L + o + (n-1)*g
processor has n*o cycles of overhead
Has (n-1)*(g-o) idle cycles that could be overlapped with other
computation
L
g
time
Some Typical LogGP values
CM5:
L
o
g
G
Slightly constrained version:
= 20.5 us
= 5.9 us
= 8.3 us
= 0.007 us (140 MB/s)
T3D:
L
o
g
G
= 16.5 us
= 6.0 us
= 6.2 us
= 0.125 us (8MB/s)
Intel Paragon:
L
o
g
G
Message Passing Programs
Separate processes, separate address spaces
Processes execute independently and concurrently
Processes transfer data cooperatively
General version: Multiple Program Multiple Data (MPMD)
MPI: most popular message passing library
= 0.85 us
= 0.40 us
= 0.40 us
= 0.007 us (140 MB/s)
Hello World (Trivial)
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "I am %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
MPI_Init( &argc, &argv );
printf( "Hello, world!\n" );
MPI_Finalize();
return 0;
We need to fill in the details in
Process 0
Process 0 sends array A to process 1 which receives it as B
1:
#define TAG 123
double A[10];
MPI_Send(A, 10, MPI_DOUBLE, 1,
TAG, MPI_COMM_WORLD)
2:
#define TAG 123
double B[10];
MPI_Recv(B, 10, MPI_DOUBLE, 0,
TAG, MPI_COMM_WORLD, &status)
Process 1
Send(data)
Receive(data)
Things that need specifying:
How will processes be identified?
How will data be described?
How will the receiver recognize/screen messages?
What will it mean for these operations to complete?
Processors belong to communicators (process groups)
Default communicator is MPI_COMM_WORLD
Communicators have a size and define a rank for each
member
Point-to-Point Example
MPI Basic Send/Receive
extended message-passing model
not a language or compiler specification
not a specific implementation or product
Hello World (Independent Processes)
A simple, but not very interesting SPMD Program.
To make this legal MPI, we need to add 2 lines.
Single Program Multiple Data (SPMD)
Single code image running on different processors
Can execute independently (or asynchronously), take different branches for
instance
or
MPI_Recv(B, 10, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status)
status: useful for querying the tag, source after reception
Collective Communication in MPI
MPI DataTypes
The data in a message to be sent or received is described by a
triple (address, count, datatype), where
An MPI datatype is recursively defined as:
Collective operations are called by all processes in a
communicator.
MPI_BCAST distributes data from one process to all others in
a communicator.
MPI_Bcast(start, count, datatype,
source, comm);
MPI_REDUCE combines data from all processes in
communicator and returns it to one process.
MPI_Reduce(in, out, count, datatype,
operation, dest, comm);
For example:
predefined, corresponding to a data type from the language (e.g.,
MPI_INT, MPI_DOUBLE_PRECISION)
Goal: support heterogeneous clusters
a contiguous array of MPI datatypes
a strided block of datatypes
layout in memory
an indexed array of blocks of datatypes
an arbitrary structure of datatypes
MPI_Reduce(&mysum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
May improve performance:
reduces memory-to-memory copies in the implementation
allows the use of special hardware (scatter/gather) when available
Non-blocking Operations
Split communication operations into two parts.
Using Non-blocking Receive
Two advantages:
No deadlock (correctness)
First part initiates the operation. It does not block.
Second part waits for the operation to complete.
MPI_Request request;
MPI_Recv(buf, count, type, dest, tag, comm, status)
=
MPI_Irecv(buf, count, type, dest, tag, comm, &request)
+
MPI_Wait(&request, &status)
Non-Blocking Communication Gotchas
Waits for operation to complete and returns info in status
Frees request object (and sets to MPI_REQUEST_NULL)
MPI_Request_free(INOUT request)
MPI_Waitall(..., INOUT array_of_requests, ...)
MPI_Testall(..., INOUT array_of_requests, ...)
MPI_Waitany/MPI_Testany/MPI_Waitsome/MPI_Testsome
1. You may not modify the buffer between Isend() and the
corresponding Wait(). Results are undefined.
2. You may not look at or modify the buffer between Irecv() and the
corresponding Wait(). Results are undefined.
3. You may not have two pending Irecv()s for the same buffer.
Less obvious:
Frees request object but does not wait for operation to complete
Wildcards:
Obvious caveats:
MPI_Test(INOUT request, OUT flag, OUT status)
Tests to see if operation is complete and returns info in status
Frees request object if complete
Send(0)
Recv(0)
Isend(1)
compute
Wait()
Send(1)
Recv(1)
Process 0
Operations on MPI_Request
MPI_Wait(INOUT request, OUT status)
Process 1
Data may be transferred concurrently (performance)
MPI_Send(buf, count, type, dest, tag, comm)
=
MPI_Isend(buf, count, type, dest, tag, comm, &request)
+
MPI_Wait(&request, &status)
Process 0
4. You may not look at the buffer between Isend() and the
corresponding Wait().
5. You may not have two pending Isend()s for the same buffer.
Why the isend() restrictions?
Restrictions give implementations more freedom, e.g.,
Heterogeneous computer with differing byte orders
Implementation swap bytes in the original buffer
10