Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
80 views9 pages

Basic Matrix Operations On A DSP Array Architecture: September 2000

This document summarizes a research paper that describes a single instruction multiple data (SIMD) digital signal processing (DSP) array architecture for performing basic matrix operations. The architecture consists of processing elements arranged in a two-dimensional grid that are controlled by a central control unit. The processing elements are optimized to perform multiply-accumulate operations on fixed-point data. The paper discusses how the array can execute common matrix operations like matrix-vector multiplication, matrix addition/subtraction, and matrix transposition by broadcasting instructions from the control unit to the processing elements. Input data is fed into the array from the north and output is produced at the east side after computations are completed.

Uploaded by

AdiseshuMidde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views9 pages

Basic Matrix Operations On A DSP Array Architecture: September 2000

This document summarizes a research paper that describes a single instruction multiple data (SIMD) digital signal processing (DSP) array architecture for performing basic matrix operations. The architecture consists of processing elements arranged in a two-dimensional grid that are controlled by a central control unit. The processing elements are optimized to perform multiply-accumulate operations on fixed-point data. The paper discusses how the array can execute common matrix operations like matrix-vector multiplication, matrix addition/subtraction, and matrix transposition by broadcasting instructions from the control unit to the processing elements. Input data is fed into the array from the north and output is produced at the east side after computations are completed.

Uploaded by

AdiseshuMidde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331546964

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

Conference Paper · September 2000

CITATIONS READS
0 622

2 authors, including:

Lars Bengtsson
Chalmers University of Technology
39 PUBLICATIONS   270 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Hardware using RNS arithmetic View project

Energy efficient RFID protocols View project

All content following this page was uploaded by Lars Bengtsson on 02 May 2019.

The user has requested enhancement of the downloaded file.


BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE
LARS BENGTSSON, STEFAN LUND
Halmstad University, Centre for Computer Systems Architecture,
Box 823, S-301 18 Halmstad, Sweden
[email protected], [email protected]

Abstract. and the timing and synchronization is given by the CU


Many processing tasks in signal and image process- broadcasting a central global clock signal. A status signal
ing can be reduced to a common set of basic matrix prim- “array status” (typically the “OR” sum of one status bit
itives, e.g. matrix/vector multiplication, matrix per PE) is read by the CU to monitor the state of the array.
transposition and inversion, and solving systems of equa- The PEs, interconnected as 2D array with nearest
tions. A highly parallel array architecture for such appli- neighbour connections, are optimized to perform MAC
cations is presented and it is shown how some frequently operations (fix point representation). Input data is re-
used matrix operations are performed. The array, consist- ceived as a vector, one vector item per array column, at
ing of PEs interconnected as a 2D-grid, executes instruc- the array top border. After completed processing, an out-
tions according to the SIMD (Single Instruction Multiple put vector is produced at the eastern array border.
Data) parallel computing style. It is scalable, both in
terms of problem size and when porting it to future down- External Input
scaled CMOS processes. 16 16 16 16
Keywords: Parallel DSP, Multi-channel signal
processing, Matrix computations, Architecture and im- 16
parallel-to-serial to/from CU
plementation. 16
1 1 1 1
1 16
PE array
1 Introduction 1 16

serial-to-parallel

External Output
1 16
The major computational requirements for many real-
time processing tasks in signal and image processing can
be reduced to a common set of basic matrix primitives
[1]. This set includes matrix/vector multiplication, ma-
trix/matrix multiplication and addition, matrix inversion,
1 16
solution of linear systems, eigen systems solution, matrix
decomposition (LU, QR and singular value decomposi- instructions array status
tion) and the generalized SVD algorithm.
This paper presents the way in which such matrix Control Unit
computations are performed on a parallel DSP array ar- (CU)
chitecture. Apart from complex operations such as matrix
inversion, matrix/vector multiplication and solving a sys- Figure 1 . The SIMD array architecture.
tem of linear equations, simpler operations, such as ma-
trix addition/subtraction and matrix transposition are 2.1 Processing Elements
presented. The processing elements used in the architecture use
a bit-serial data path including a (bit-parallel) register
2 The “REMAP-γ” DSP array architecture file. Each PE can address the registers in the file inde-
pendently of other PEs using an index register, X. The
Figure 1 shows the array architecture. Based on the global CU supplies the base-address (common to all
SIMD parallel computing model, a central Control Unit PEs), and each PE adds its own 8-bit offset (in the AC-
generates and issues instructions (control, address and CUMULATOR) to this base, yielding the final address.
timing) to the processors (PEs) in the array. All PEs re- This facilitates “local address modification”, useful in
ceive the same instruction and performs the same opera- (e.g. non-linear function) table lookup operations.
tion (a “minor” local control modification is possible). In Figure 2 shows the PE data path architecture. The
its classical definition, the model is fully synchronous central path of this is a bit-serial ALU with two 1-bit in-
put buses (A and B), a carry-in input and two 1-bit out- RBROADC (Row BROADCast) and RMAC (Row Mul-
puts (S and COUT). S is fed to an accumulator (shift tiply-and-ACcumulate). These instructions are described
register), to a 1-bit register (R) and to the index register. later in the text.
COUT (Carry out) is fed to a 1-bit register (C) for carry- Input to the array is read from the north side. On the
save. All data transport through the PE flows through this northern PE border, the N(orth) inputs are connected to
ALU. The data may be modified by the ALU or simply parallel/serial conversion registers. In this way, the
pass unaffected with the same time delay. A T-bit is used SOUTH instruction will push input data (16-bits) into the
for “tagged” (selective) memory writes. array from the North side. The SOUTH instruction also
simultaneously outputs the rightmost PE column to the
from Global from
Control Unit
neighbors
output interfaces. In this way, array input and output are
A B
1 1 N
performed concurrently.
16 Wd
Write data
X-Reg
E
W
Output from the array is also done by the EAST in-
S
(256*16) Address 8
S struction, which shifts (“pops”) the PE accumulator con-
tents one PE step to the east. The rightmost PEs (on the
Register File
eastern border) have their outputs connected to serial/
Read parallel conversion registers, which will serially be filled
data 1 ALU C
with 16-bit data. These registers are readable/writeable
16
by the global CU while the array is working.
1 S Cout
PS-Register to neighbor

R
N,E,W,S 2.2 The instruction set
This section describes the REMAP-γ instruction set.
to local CU
Instructions from the global CU are issued at the word
Q_Register T
1 data-level. Most instructions operate on 16-bit words.
16
2*16 bit ACCUMULATOR
However, some generate 32 bits (e.g. the RMAC and
CMAC instructions). The CU sends these instructions at
16 bit Serial/Parallel 16
Multiplier “low” speed (typ. 400/16 MHz), and they are interpreted
Wd and executed by the local CUs at “high” speed (typ. 400
MHz) at the bit-level. Table 1 below shows the 34 in-
Figure 2 . The PE data path architecture structions and their functions (the arithmetic instructions
use 2-complement form data values).
Each PE has a dedicated 16 bit, two’s-complement,
serial/parallel multiplier that makes multiplication in es- Table 1 . The Instruction Set
sence as fast (per bit) as addition/subtraction. It is fed by
16-bit parallel data (from Q-reg) and by serial input data Instruction Instruction
(from PS-reg), least significant bit first. The output data
is serially generated with least significant bit first. ADD (SUB) MAC
Figure 3 shows the multiplier structure. Add (sub) register PS to Multiply Q-register with PS-
(from) the ACC (16 bits) register and accumulate in
Q(0) Q(1) Q(2)
Q(14) Q(15) ACC. Result in ACC (32 bits).
Serial in
RBROADC (pe, REG) CBROADC (pe, REG)
& & & & &
Serial out PE in column # ‘pe’ broad- PE in row # ‘pe’ broadcasts its
S
FA
S
FA
S
FA
S
FA
S
FA casts its ‘REG’ (ACC high ‘REG’ (ACC high word, or
word, or PS) in each Row PS) in each Column
C C C C C

RMAC CMAC
Figure 3 . The 16-bit two’s-complement serial/parallel Multiply-and-ACcumulate Multiply-and-ACcumulate
multiplier each row. Result in ACC (32 each column. Result in ACC
bits) of PEs in the rightmost (32 bits) of PEs in the lowest
Communication with the neighbouring PEs takes column. row.
place through four wires input to the ALU input bus B RBROADC_DIAG (REG) CBROADC_DIAG (REG)
(North,East,South,West) and output through a multiplex- The PEs in the diagonal The PEs in the diagonal
er where one of three sources is selected. The first source broadcasts its ‘REG’ (ACC broadcasts its ‘REG’ (ACC
is the ACCUMULATOR and PS registers. This is used high word, or PS) in each Row high word or PS) in each Col-
umn
e.g. in the NORTH,SOUTH,EAST and WEST instruc-
tions, where data are shifted one PE step in the array. The RMIN/RMAX CMIN/CMAX
third source (the 1-bit register R) is the source in those in- Find the min./max value (in Find the min./max value (in
structions where data are passed through the PE on a bit- ACC) across each row. Result ACC) across each column.
in ACC of the rightmost col- Result in ACC of the lowest
by-bit basis. Examples of instructions that use this are
umn row
Table 1 . The Instruction Set time required is 30 cycles.
When the complete broadcasted word has been col-
Instruction Instruction lected in the respective PE, it raises its PE_READY sig-
nal, informing the global CU that the instruction has been
ADDX NORTH (REG),SOUTH
Add ACC (bits 23-16) to regis- (REG), EAST (REG), WEST
completed in this PE.
ter X. Result in X. (REG) The broadcast instructions utilize a “bit-level pipe-
Shift ‘REG’ (ACC32, ACC16 lined” approach to distribute data amongst rows and col-
or PS) to next PE in the array umns. Using only short nearest neighbour connections,
STORE STORE_T (ACC_HIGH,
scalability is maintained when porting the architecture to
(ACC_HIGH,ACC_LOW). ACC_LOW). Write the ACC future down-scaled CMOS deep-submicron IC-process-
Write the ACC (High Word or to local memory, if the T-bit is es [2].
Low Word) to local memory set.
(address in X)
2.2.2 The Multiply-and-ACcumulate
R_SELECTF C_SELECTF instructions:
Searches the rows of PEs for Searches the columns of PEs
T-bits set.Clears all but the for T-bits set. Clears all but The Multiply and Accumulate instructions imple-
first one starting from the the first one starting from the ment MAC operations either locally (MAC) or for each
rightmost column. bottom row.
row (RMAC) or each column (CMAC). The RMAC and
CLEAR INIT CMAC instructions use the same “bit-level pipelined”
Clear ACC, C, R, T and the Initialize the ROW and COL approach as the broadcast instructions. The principle is
multiplier S- and C-bits registers in each PE. the same as described in Figure 5, with the addition that
AND (OR) LOAD_X the bits are produced by the multipliers and accumulated
Logical AND (OR) the PS and Load the registers with data along the way by the ALUs.
ACC registers (16 bits). from the instr. parameter field Figure 6 illustrates this scheme.
(8-bits)
RMAC instruction starts in this column RMAC instruction ends in this column
LOAD_PS, LOAD_Q SHIFTACC (n)
Load the PS (Q) registers with Shift the ACC register n steps
data from the local memory to the right. bit-serial data flow

SHIFTACC_LEFT (n) DIV_RESTORE row R PE PE PE PE


Shift the ACC register n steps Used in Divide procedure.
to the left.
Col. no: 1 2 3 N
GRT LESS
Sets the T-bit if the PS register Sets the T-bit if the PS register
is greater (or equal to) than is less than the ACC register
Figure 6 . Illustration of the RMAC instruction
the ACC register (high word) (high word)
The time required for these instructions depends on
the array size. If the array size is N*N, the time required
for a RMAC or CMAC operation is N-1+log(N)+32 cy-
2.2.1 Broadcast instructions cles. For an array size of 16*16 PEs, this yields 51 clock
The broadcast instructions broadcast (using the PE cycles. For a 32*32 array, 68 cycles are required.
nearest neighbour connections only) the Accumulators or
PS registers in column ‘col’ (RBROADC) and in row 3 Basic matrix primitives
‘row’ (CBROADC) on each row and column respective-
This section presents mappings and performance of
ly. The ‘pe’ argument indicates which PE column
some selected basic matrix operations on the architec-
(RBROADC) or PE row (CBROADC) is the source of
ture.
the broadcast. The RBROADC_DIAG and
CBROADC_DIAG instructions use the PE elements of
the main diagonal as the source. Each PE uses the ‘pe’ ar- 3.1 Multiplying a vector with the transpose
gument (supplied in the instruction field) to determine if of a matrix
they are the source or one of the destinations. The source In some algorithms (e.g. artificial neural networks -
PEs start broadcasting immediately, and the destination ANN, it is necessary to, after first doing the ordinary ma-
PEs waits the appropriate number of clock cycles (deter- trix by vector multiplication (R = WX), also multiply the
mined by the difference between ‘pe’ and the respective result vector R with the transpose of the same matrix
PE ROW or COL number) before starting to receive the (WT). (EQ 1) illustrates these calculations.
16 bit long bitstream.
The maximum time for the broadcast instructions de-
pends on the array size. If the array size is N*N, the time
is (N-2)+16 cycles. For a typical size of 16*16 PEs, the
ri = ∑x w
j
j ij
LOAD_X (B) /* load X register with
address to B */
(EQ 1)

T LOAD_PS /* load B elements into
si = r jw PS registers */
j
ADD () (or SUB) /* add/sub A and B items */
STORE (ACC_HIGH) /* store result in B */
These operations are easily executed by the array us-
ing the RMAC and CMAC instructions as shown in the
3.4 Matrix inversion
following example. The ordinary matrix vector multipli-
cation (R = WX) is first calculated (assuming the X vector To invert a matrix A, the Gauss-Jordan elimination
resides in the top row) using the “RMAC” instruction af- method may be used. This method is normally used to
ter which the result vector R resides in the eastern PE col- solve a linear system of equations. This section first de-
umn #N-1. The R vector is then broadcasted on the rows scribes how this solving can be done on the array. It is
and the CMAC instruction is used to create the result vec- then shown how this can be extended to perform matrix
tor S. inversion.

CBROADC (0,PS) 3.4.1 The Gauss-Jordan elimination method


RMAC /* perform R = WX */
This algorithm solves a system of linear equations
/* The accumulator is moved
with a variation of Gauss-elimination. No back-substitu-
to the PS register */
tion is necessary to complete the solving. All off-diagonal
RBROADC (N-1,PS)/* broadcast the R vector
elements are eliminated. The resulting matrix and vector
from
require only one division computation for each element to
column #N-1 */
produce the solution vector.
CMAC /* MAC across the columns*/
An example linear system of equations with three var-
iables is shown below.
The S vector now resides in the southern (highest
numbered) row and may be moved to the eastern border if
needed using only two further instructions (CBROADC
A 11 A 12 A 13 x1 B1
(N-1), RBROADC_DIAG). Normally, however, as is the
case in the ANN training algorithm “back-propagation”, A 21 A 22 A 23 × x2 = B2 (EQ 2)
this operation is one part in a long sequence and the result A 31 A 32 A 33 x3 B3
is not output but is used further in the calculations.
The algorithm is as follows (showing the elimination
3.2 Vector transpose of element A21):
Transposing a vector is straightforward and uses only
two instructions (assuming the vector resides in the top
•Calculate E = -A21 / A11
row). First, the vector is broadcasted on the columns by
the CBROADC instruction. Second, the •For J=1 to 3 Calculate A2J = A2J + A1J * E
RBROADC_DIAG instruction broadcasts the diagonal •Calculate B2 = B2 + B1 * E
elements on the rows. The transposed vector now resides
in the rightmost column.
This proceeds, one column at a time, until all off-diag-
The instruction sequence for vector transpose is:
onal elements have been eliminated. Finally, to obtain the
solution vector:
SOUTH (PS) /* input the vector */
CBROADC (0,PS) /* broadcast it on the •For i=1 to 3 Calculate Xi = Ai,i / Bi
columns*/
RBROADC_DIAG (PS)/* the diagonal PEs do a
row broadcast */ The general algorithm (in its sequential form) can be
EAST (PS) /* output the vector */ expressed using the following pseudo-code:

FOR c IN 1 TO NO_COLS LOOP /* for each


3.3 Matrix addition and subtraction
col*/
Addition (or subtraction) is done in parallel using the FOR r in 1 TO NO_ROWS LOOP /* for each
ADD (SUB) instruction. If two matrices, A and B, are to row
be added, the instructions issued are (assuming that the A IF r=c CONTINUE;/* skip to next row if r=c
matrix element resides in the ACCumulators): Er = -Arc/Acc;
FOR j IN 1 TO NO_COLS LOOP/* for
each column in A */
Arj = Arj + Acj*Er; FOR i IN 0 TO NO_COLS-1 LOOP /* For each
ENDLOOP; col
Br = Br + Bc*Er; Load_acc (A_B)
ENDLOOP; CBROADC (ACC,i) /* TEMPc
ENDLOOP; <=
FOR r in 1 TO NO_ROWS LOOP/* for each A Aic, or TEMPc <= Bi */
row*/ Store_acc_hw (TEMP)
Xr = Ar,r / Br Divide (A_B,TEMP)
ENDLOOP Store_acc_lw (E)
Multiply (H,E) /* clear all
Parallelizing this sequential code yields the follow- E values on row i*/
ing pseudo-code: RBROADC (ACC,i) /* broadcast
FOR i in 1 TO NO_COLS LOOP E values on the rows */
Broadcast row #i on the columns; Store_acc_hw (E) /* Store Er
Er = -Ari/Aii for each row r not equal to i. Ei = 0; <=
Broadcast Er on the rows; Arc/TEMPc * Hrc */
Arc = Arc + Aic*Er; Multiply (TEMP,E)
Br = Br + Bi*Er; Store_acc_hw (SLASK)
ENDLOOP
Load_acc (A_B)
No row pivoting (row swapping) is used here, and
LOAD_X (SLASK)
thus cases in which a diagonal element is zero (Aii=0; or
LOAD_PS
Bi=0) will produce a wrong result. Searching for the
SUB
maximum element and row swapping may be added to
Store_acc_hw (A_B) /* Store Arc
cope with this situation.
=
Arc - TEMPc*Er, or Br = Br - TEMPc*Er
3.4.2 Executing Gauss-Jordan on the array: */
This section shows the code segment needed to exe- Load_acc (H)
cute the algorithm. The coefficient matrix A and result SOUTH (ACC16)/* shift the help matrix
vector B are first loaded in the PE array by storing them down
in the PE local memories. The A matrix is stored in col- one PE step. All “ones” are
umns 0 to #NO_COLS-1, and the B vector is stored in shifted in to the top row. */
the rightmost column (#NO_COLS). Also, a help ma- Store_acc_hw (H)
trix, H, is stored with an initial contents of: ENDLOOP
Load_acc (A_B)
RBROADC_DIAG (ACC) /* get diago-
0 0 0 0 … 0 nal
1 1 1 1 … 1 elements*/
H = 1 1 1 1 … 1 (EQ 3) Store_acc_hw (SLASK)
… … … … … … Divide (A_B, SLASK) /* output
1 1 1 1 1… 1 vector
is now in ACC (low word) in col. 3 */
This help matrix is used to inhibit adjustment of row
#i during the elimination process. The procedure (macro) “Multiply” above contains
The elimination executed on the array is performed the following instructions:
column by column. It starts from the leftmost column Multiply (K,L):
(column #0), eliminating all off-diagonal elements in CLEAR
this column, and proceeds towards the rightmost col- LOAD_X (K)
umn. LOAD_Q
LOAD_X (L)
The instructions broadcast from the global CU are: LOAD_PS
MAC

Macros “Load_acc ()”, “Divide ()”, “Store_acc_hw


()” and “Store_acc_lw ()” loads the ACC, divides Q by
M, stores the ACC high word bits 31-16 (“hw”) or ACC
low word bits 15-0 (“lw”). using three separate methods, each with different per-
Solving a 31*31 equation system (using a 32*32 ar- formance characteristics regarding throughput, latency
ray) requires 51040 cycles. At 100 MHz clock frequency and the amount of hardware needed. These are:
this corresponds to 510 µsec.
1. Using column broadcast and the RMAC instruction
(“RMAC”).
2. Using a systolic type of processing with the local
3.4.3 Matrix inversion using Gauss-Jordan MAC instruction and column broadcast (“pseudo-
systolic 1”).
elimination
3. Using a systolic type of processing using local MAC
To show that the Gauss-Jordan elimination method and nearest neighbour PE communications only
can perform matrix inversion, consider the following: (“pseudo-systolic 2”).
First, use the relationship:
3.5.1 Matrix/vector multiplication using
–1
AA = I (EQ 4) column broadcast and the RMAC
where I is the unity matrix. Then, use the notation: instruction
The matrix/vector multiplication is basically per-
AX = B (EQ 5) formed by issuing two instructions, CBROADC(0) and
Identifying X and B in (EQ 4), yields that: RMAC (assuming the matrix is already loaded in the Q
registers of the PEs). First, the vector is input to row #0.
–1
Second, this vector is broadcasted on the columns using
X = A B = I (EQ 6) the CBROADC (0) instruction. Third, the RMAC in-
Thus, by using Gauss-Jordan elimination with matri- struction performs the multiply-and-accumulate between
ces A and B, where B is initialized as the unity matrix I, the matrix and the vector elements. The result vector is
we can calculate the inverse matrix A-1. found in the rightmost column. Figure 7 illustrates this
As an example, to find the inverse of the 3*3 matrix with matrix A, input vector X and output vector O.
A, we start with the following:
input vector
x1 x2 x3 x4
A 11 A 12 A 13 x 11 x 12 x 13 1 0 0
a 11 a 12 a 13 a 14 o1
A 21 A 22 A 23 x 21 x 22 x 23 = 0 1 0 (EQ 7)
A 31 A 32 A 33 x 31 x 32 x 33 0 0 1
a 21 a 22 a 23 a 24 o2

output vector
The A matrix is loaded in the first three columns of RMAC
the array. The unity matrix is loaded in the last three col- a 31 a 32 a 33 a 34 o3
umns. The elimination is now performed as in section
3.4.2, with the exception that three columns in the B ma-
trix are handled instead of only one. a 41 a 42 a 43 a 44 o4
The number of clock cycles required to invert a
32*32 matrix using a 32*64 array is 53771 cycles. At 100 CBROADC (0)
MHz clock frequency this corresponds to 538 µsec.
Mapping the algorithm this way means that the array Figure 7 . Matrix/Vector multiplication using column
is not quadratic. If this is not desirable, a quadratic array broadcast and RMAC
can be used if the B matrix is stored in the same PEs as In this method a new matrix by vector multiplication
the A matrix (in different memory positions). Of course, is not started until the previous one has completed. The
this slows down the algorithm execution, because the A main benefit is the very low latency (see Table 2) for a
and B elements can in this case not be calculated (adjust- specific vector, measured as the time delay between in-
ed) in parallel. put and output.

3.5 Performing Matrix - Vector multiplica-


3.5.2 Matrix/vector multiplication using the
tion
local MAC instruction and column
Many signal processing algorithms can basically be
broadcast (“pseudo-systolic 1”)
formulated as a matrix by vector multiplication problem
[3]. REMAP-γ can perform matrix/vector multiplication This method uses a systolic-like processing where the
input vectors are broadcasted on the columns and the ac-
cumulated sums are shifted east one PE step at a time PE datapath.
through the array. The local MAC instruction produces
the products and accumulates the sum. The result vector
input vectors
appears at the eastern output after some latency.

D
The procedure is as follows: the input vector is shift-

output vectors
D

from array
ed down (SOUTH) one PE step (this also inputs the next
D D

D
vector at the same time). The CBROADC_DIAG in-
struction is then used to broadcast the vectors on the re- to array

spective columns using the diagonal PE elements as Skew section Deskew section

sources. Next, the MAC instruction produces the product


and accumulates the sums. Finally, the accumulated Figure 9 . Skew and deskew registers needed in the
sums are shifted one PE step to the east (and at the same “pseudo-systolic 2” matrix/vector multiplication case
time output a result vector). Figure 10 shows how this matrix/vector multiplica-
Figure 8 shows this principle with matrix A and the tion is performed by the array. The matrix is stored in the
input vector stream F, G, H, I,...... PEs’ Q registers, i.e. one matrix element in each PE. In
In the systolic methods where vectors are pipelined each step, the vectors are shifted one step south, and the
through the array, a new result is produced in each loop accumulated sums are shifted one step to the east. A re-
(after N initial loops). However, there is a “high” latency sult is produced in each step, and the resulting vector is
(equal to N loops) for a given vector. found by deskewing the resulting vectors from the east
edge PEs.
i1 i2 i3

a 11 ⋅ g 1
a 11 ⋅ f 1 + k1 j2
a 11 ⋅ h 1 a 12 ⋅ g 2 + i3
Input Vector
a 11 a 12 a 13 a 12 ⋅ f 2 + j1 i2 h3
a 13 ⋅ f 3 i1 h2 g3
h1 h2 a 21 ⋅ g 1 h a 11 ⋅ g 1
a 22 ⋅ g 2 + 3 a 11 ⋅ h 1 a 12 ⋅ g 2+
a 21 ⋅ h 1 a 21 ⋅ f 1 +
a 12 a 11 ⋅ f 1+ a ⋅ e + a 11 ⋅ d 1+
a 21 a 22 a 23 a 11 a 13 a 12 ⋅ f 2+ a11 ⋅ e1+ a 12 ⋅ d 2+
a 22 ⋅ f 2 + 12 2
g 2 a 21 ⋅ f 1 a 13 ⋅ f3 a ⋅ e a 13 ⋅ d
g1 a 31 ⋅ g 1 g3 a 23 ⋅ f 3 h1 a ⋅ g
a 22 ⋅ f 2+
f3 13 3 3
g2 21 1
a 21 ⋅ e 1+ a ⋅ d + a 21 ⋅ c 1
a 31 ⋅ h 1 a 32 ⋅ g 2 + a 31 ⋅ f 1 + a 21 a 22 a 23 a 22 ⋅ e 2+ a21 ⋅ d 1+ a 22 ⋅ c 2
a 31 a 32 a 33 a 32 ⋅ f 2 +
22 2
a 23 ⋅ e 3 a ⋅ d a 23 ⋅ c
g1 f 2 a 31 ⋅ e 1 e3 23 3 3
a 31 ⋅ f 1 a 32 ⋅ e 2+
a 33 ⋅ f 3 a 31 ⋅ d 1+ a ⋅ c + a 31 ⋅ b 1
f1 f2 f3 a 31 a 32 a 33 a 32 ⋅ d 2+ a31 ⋅ c1+ a 32 ⋅ b 2
32 2 a 33 ⋅ b
a 33 ⋅ d 3 a ⋅ c 3
33 3
f1 e2 d3

Figure 8 . “pseudo-systolic 1” Matrix/Vector


multiplication a11 a12 a13 d1 a11 ⋅ d 1 + a12 ⋅ d 2 + a13 ⋅ d 3
a21 a22 a23 ⋅ d2 = a21 ⋅ d 1 + a22 ⋅ d 2 + a23 ⋅ d 3
3.5.3 Using systolic processing and skew/
deskew external registers (“pseudo-systolic a31 a32 a33 d3 a31 ⋅ d 1 + a32 ⋅ d 2 + a33 ⋅ d 3
2”)
Figure 10 . “pseudo-systolic 2” matrix/vector
This method uses systolic processing where the input multiplication with skew/deskew of in- and outdata
vector is delayed (skewed) and the output vector is
deskewed according to Figure 10. The local MAC in- Table 2 shows a comparison of the performance of
struction is used to create the local product and add this the three matrix-vector multiplication methods in terms
to the accumulated sum shifted in from the left PE neigh- of sustained throughput and latency. The clock frequency
bour. The column broadcast is not necessary here as in assumed is 100 MHz and the array size is 32*32.
the first two methods.
This method yields the highest throughput but has a Table 2. Matrix/vector performance on a
high latency (although not as high as “pseudo-systolic 32*32 array (@ 100 MHz)
1”). It requires extra hardware to skew/deskew the input
and output vectors (shown in Figure 9). Each of these Sustained
Latency
registers has a size equal to the PS (or Q) register in the Method throughput Latency (µs)
(clock cycles)
(GOPS)

“RMAC” 1.4 146 1.46

“pseudo- 1.6 4032 40


systolic 1”
Table 2. Matrix/vector performance on a registers.
32*32 array (@ 100 MHz) Table 2 below summarizes the most important design
parameters of the 8*8 test design.
Sustained
Latency
Method throughput
(clock cycles)
Latency (µs) Table 3 . Summary of test chip (8*8) data
(GOPS)
Chip parameter Data
“pseudo- 2.56 2560 25.6
Technology 0,7 µm CMOS Double
systolic 2”
Layer Metal
Clock frequency (MHz) 100
As Table 2 reveals, the “pseudosystolic 2” method is 2
Array area (mm ) 225 (15 x 15)
superior with respect to throughput (but requires extra
skew- and deskew registers), and the “RMAC” method is Number of cells (excluding 80450
register memory)
superior with respect to latency.
Estimated power dissipa- < 12
tion (W)
4 VLSI Test implementation
Two test prototype chips, one 16 PEs (4*4) and one 64 Using scaling rules for CMOS [4] and [5], it is esti-
PEs (8*8) have been designed using VHDL synthesis and mated that when using a state-of-the-art CMOS process
standard cells. The technology used was the ES2 0.7 mi- (0,18 µm), approximately four times higher clock speed
cron N-well CMOS double layer metal process. The phys- (400 MHz) and 16 (42) times smaller area should be ex-
ical layout was created with Cadence place&route tools pected. Thus, a 32*32 array would fit in one single chip.
which resulted in an array area of 225 mm2 (8*8 PEs) and
a clock speed of 100 MHz. 5 Conclusions
The chip (4*4) block diagram is shown in Figure 11.
As can be seen, the parallel-to-serial and serial-to-parallel This paper has shown the mapping and performance
conversion interfaces at the northern and eastern borders of basic matrix operations on a novel parallel DSP array
are included on-chip. These interfaces can, through the architecture. These matrix operations, basic in many
use of multiplexers, be bypassed on those PE chips that signal processing algorithms, include matrix inversion,
are not placed at the array borders when a multiple-chip matrix/vector multiplication, solving systems of linear
array is constructed. equations, matrix addition/subtraction and matrix
The input interface (µPI_IN) includes four parallel-in/ transposition. Performance figures were given and
serial-out shift registers. The output interface (µPI_OUT) compared in terms of throughput, latency and execution
includes four serial in/parallel out shift registers. times. Data for a VLSI test chip in 0,7 µm CMOS was
presented and estimations shown regarding performance
and chip size using a state-of -the-art process.
DATA_IN 2
DATA_IN 1
DATA_IN 4
DATA_IN 3

Parallel
IN
6 References
µPI_IN(Parallel to Serial interface) [1] J.H. Moreno, T. Lang, Matrix Computations on
Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-
0 1 0 1 0 1 0 1 DATA_OUT 7923-9237-X, 1992.
‘0’= DATA_OUT
bypass [2] L. Bengtsson, “REMAP-γ: A Scalable SIMD
VLSI Architecture with Hierarchical Control”, PhD
PE14 PE13 PE12 PE11 thesis no. 320 , School of Electrical and Computer
µPI_OUT(Serial to Parallel interface)

Engineering, Chalmers University of Technology,


PE24 PE23 PE22 PE21
Gothenburg, Sweden, 1997.
[3] H.T. Kung, C.E. Leiserson, “Algorithms for VLSI
Parallel out

Processor Arrays”, In “Introduction to VLSI Sys-


PE34 PE33 PE32 PE31 tems”, Mead & Conway, Addition-Wesley, 1980.
[4] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L.
Rideout, E. Bassous, A.R. LeBlanc, “Design of ion-
PE44 PE43 PE42 PE41 implanted MOSFET’s with very small physical
DATA_OUT dimensions”, IEEE J. Solid-State Circuits, Vol. SC-9,
DATA_OUT
p.256, 1974.
[5] K.C.Saraswat, F. Mohammadi, “Effect of Scaling
Figure 11 Test chip (4*4) block diagram. of Interconnections on the Time Delay of VLSI Cir-
cuits”, IEEE J. of Solid-State Circ., vol. SC-17, no. 2,
The 8*8 test chip has the same type of block diagram
pp. 275-280, April 1982.
but has 64 PEs and eight input registers and eight output

View publication stats

You might also like