Basic Matrix Operations On A DSP Array Architecture: September 2000
Basic Matrix Operations On A DSP Array Architecture: September 2000
net/publication/331546964
CITATIONS READS
0 622
2 authors, including:
Lars Bengtsson
Chalmers University of Technology
39 PUBLICATIONS 270 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Lars Bengtsson on 02 May 2019.
serial-to-parallel
External Output
1 16
The major computational requirements for many real-
time processing tasks in signal and image processing can
be reduced to a common set of basic matrix primitives
[1]. This set includes matrix/vector multiplication, ma-
trix/matrix multiplication and addition, matrix inversion,
1 16
solution of linear systems, eigen systems solution, matrix
decomposition (LU, QR and singular value decomposi- instructions array status
tion) and the generalized SVD algorithm.
This paper presents the way in which such matrix Control Unit
computations are performed on a parallel DSP array ar- (CU)
chitecture. Apart from complex operations such as matrix
inversion, matrix/vector multiplication and solving a sys- Figure 1 . The SIMD array architecture.
tem of linear equations, simpler operations, such as ma-
trix addition/subtraction and matrix transposition are 2.1 Processing Elements
presented. The processing elements used in the architecture use
a bit-serial data path including a (bit-parallel) register
2 The “REMAP-γ” DSP array architecture file. Each PE can address the registers in the file inde-
pendently of other PEs using an index register, X. The
Figure 1 shows the array architecture. Based on the global CU supplies the base-address (common to all
SIMD parallel computing model, a central Control Unit PEs), and each PE adds its own 8-bit offset (in the AC-
generates and issues instructions (control, address and CUMULATOR) to this base, yielding the final address.
timing) to the processors (PEs) in the array. All PEs re- This facilitates “local address modification”, useful in
ceive the same instruction and performs the same opera- (e.g. non-linear function) table lookup operations.
tion (a “minor” local control modification is possible). In Figure 2 shows the PE data path architecture. The
its classical definition, the model is fully synchronous central path of this is a bit-serial ALU with two 1-bit in-
put buses (A and B), a carry-in input and two 1-bit out- RBROADC (Row BROADCast) and RMAC (Row Mul-
puts (S and COUT). S is fed to an accumulator (shift tiply-and-ACcumulate). These instructions are described
register), to a 1-bit register (R) and to the index register. later in the text.
COUT (Carry out) is fed to a 1-bit register (C) for carry- Input to the array is read from the north side. On the
save. All data transport through the PE flows through this northern PE border, the N(orth) inputs are connected to
ALU. The data may be modified by the ALU or simply parallel/serial conversion registers. In this way, the
pass unaffected with the same time delay. A T-bit is used SOUTH instruction will push input data (16-bits) into the
for “tagged” (selective) memory writes. array from the North side. The SOUTH instruction also
simultaneously outputs the rightmost PE column to the
from Global from
Control Unit
neighbors
output interfaces. In this way, array input and output are
A B
1 1 N
performed concurrently.
16 Wd
Write data
X-Reg
E
W
Output from the array is also done by the EAST in-
S
(256*16) Address 8
S struction, which shifts (“pops”) the PE accumulator con-
tents one PE step to the east. The rightmost PEs (on the
Register File
eastern border) have their outputs connected to serial/
Read parallel conversion registers, which will serially be filled
data 1 ALU C
with 16-bit data. These registers are readable/writeable
16
by the global CU while the array is working.
1 S Cout
PS-Register to neighbor
R
N,E,W,S 2.2 The instruction set
This section describes the REMAP-γ instruction set.
to local CU
Instructions from the global CU are issued at the word
Q_Register T
1 data-level. Most instructions operate on 16-bit words.
16
2*16 bit ACCUMULATOR
However, some generate 32 bits (e.g. the RMAC and
CMAC instructions). The CU sends these instructions at
16 bit Serial/Parallel 16
Multiplier “low” speed (typ. 400/16 MHz), and they are interpreted
Wd and executed by the local CUs at “high” speed (typ. 400
MHz) at the bit-level. Table 1 below shows the 34 in-
Figure 2 . The PE data path architecture structions and their functions (the arithmetic instructions
use 2-complement form data values).
Each PE has a dedicated 16 bit, two’s-complement,
serial/parallel multiplier that makes multiplication in es- Table 1 . The Instruction Set
sence as fast (per bit) as addition/subtraction. It is fed by
16-bit parallel data (from Q-reg) and by serial input data Instruction Instruction
(from PS-reg), least significant bit first. The output data
is serially generated with least significant bit first. ADD (SUB) MAC
Figure 3 shows the multiplier structure. Add (sub) register PS to Multiply Q-register with PS-
(from) the ACC (16 bits) register and accumulate in
Q(0) Q(1) Q(2)
Q(14) Q(15) ACC. Result in ACC (32 bits).
Serial in
RBROADC (pe, REG) CBROADC (pe, REG)
& & & & &
Serial out PE in column # ‘pe’ broad- PE in row # ‘pe’ broadcasts its
S
FA
S
FA
S
FA
S
FA
S
FA casts its ‘REG’ (ACC high ‘REG’ (ACC high word, or
word, or PS) in each Row PS) in each Column
C C C C C
RMAC CMAC
Figure 3 . The 16-bit two’s-complement serial/parallel Multiply-and-ACcumulate Multiply-and-ACcumulate
multiplier each row. Result in ACC (32 each column. Result in ACC
bits) of PEs in the rightmost (32 bits) of PEs in the lowest
Communication with the neighbouring PEs takes column. row.
place through four wires input to the ALU input bus B RBROADC_DIAG (REG) CBROADC_DIAG (REG)
(North,East,South,West) and output through a multiplex- The PEs in the diagonal The PEs in the diagonal
er where one of three sources is selected. The first source broadcasts its ‘REG’ (ACC broadcasts its ‘REG’ (ACC
is the ACCUMULATOR and PS registers. This is used high word, or PS) in each Row high word or PS) in each Col-
umn
e.g. in the NORTH,SOUTH,EAST and WEST instruc-
tions, where data are shifted one PE step in the array. The RMIN/RMAX CMIN/CMAX
third source (the 1-bit register R) is the source in those in- Find the min./max value (in Find the min./max value (in
structions where data are passed through the PE on a bit- ACC) across each row. Result ACC) across each column.
in ACC of the rightmost col- Result in ACC of the lowest
by-bit basis. Examples of instructions that use this are
umn row
Table 1 . The Instruction Set time required is 30 cycles.
When the complete broadcasted word has been col-
Instruction Instruction lected in the respective PE, it raises its PE_READY sig-
nal, informing the global CU that the instruction has been
ADDX NORTH (REG),SOUTH
Add ACC (bits 23-16) to regis- (REG), EAST (REG), WEST
completed in this PE.
ter X. Result in X. (REG) The broadcast instructions utilize a “bit-level pipe-
Shift ‘REG’ (ACC32, ACC16 lined” approach to distribute data amongst rows and col-
or PS) to next PE in the array umns. Using only short nearest neighbour connections,
STORE STORE_T (ACC_HIGH,
scalability is maintained when porting the architecture to
(ACC_HIGH,ACC_LOW). ACC_LOW). Write the ACC future down-scaled CMOS deep-submicron IC-process-
Write the ACC (High Word or to local memory, if the T-bit is es [2].
Low Word) to local memory set.
(address in X)
2.2.2 The Multiply-and-ACcumulate
R_SELECTF C_SELECTF instructions:
Searches the rows of PEs for Searches the columns of PEs
T-bits set.Clears all but the for T-bits set. Clears all but The Multiply and Accumulate instructions imple-
first one starting from the the first one starting from the ment MAC operations either locally (MAC) or for each
rightmost column. bottom row.
row (RMAC) or each column (CMAC). The RMAC and
CLEAR INIT CMAC instructions use the same “bit-level pipelined”
Clear ACC, C, R, T and the Initialize the ROW and COL approach as the broadcast instructions. The principle is
multiplier S- and C-bits registers in each PE. the same as described in Figure 5, with the addition that
AND (OR) LOAD_X the bits are produced by the multipliers and accumulated
Logical AND (OR) the PS and Load the registers with data along the way by the ALUs.
ACC registers (16 bits). from the instr. parameter field Figure 6 illustrates this scheme.
(8-bits)
RMAC instruction starts in this column RMAC instruction ends in this column
LOAD_PS, LOAD_Q SHIFTACC (n)
Load the PS (Q) registers with Shift the ACC register n steps
data from the local memory to the right. bit-serial data flow
output vector
The A matrix is loaded in the first three columns of RMAC
the array. The unity matrix is loaded in the last three col- a 31 a 32 a 33 a 34 o3
umns. The elimination is now performed as in section
3.4.2, with the exception that three columns in the B ma-
trix are handled instead of only one. a 41 a 42 a 43 a 44 o4
The number of clock cycles required to invert a
32*32 matrix using a 32*64 array is 53771 cycles. At 100 CBROADC (0)
MHz clock frequency this corresponds to 538 µsec.
Mapping the algorithm this way means that the array Figure 7 . Matrix/Vector multiplication using column
is not quadratic. If this is not desirable, a quadratic array broadcast and RMAC
can be used if the B matrix is stored in the same PEs as In this method a new matrix by vector multiplication
the A matrix (in different memory positions). Of course, is not started until the previous one has completed. The
this slows down the algorithm execution, because the A main benefit is the very low latency (see Table 2) for a
and B elements can in this case not be calculated (adjust- specific vector, measured as the time delay between in-
ed) in parallel. put and output.
D
The procedure is as follows: the input vector is shift-
output vectors
D
from array
ed down (SOUTH) one PE step (this also inputs the next
D D
D
vector at the same time). The CBROADC_DIAG in-
struction is then used to broadcast the vectors on the re- to array
spective columns using the diagonal PE elements as Skew section Deskew section
a 11 ⋅ g 1
a 11 ⋅ f 1 + k1 j2
a 11 ⋅ h 1 a 12 ⋅ g 2 + i3
Input Vector
a 11 a 12 a 13 a 12 ⋅ f 2 + j1 i2 h3
a 13 ⋅ f 3 i1 h2 g3
h1 h2 a 21 ⋅ g 1 h a 11 ⋅ g 1
a 22 ⋅ g 2 + 3 a 11 ⋅ h 1 a 12 ⋅ g 2+
a 21 ⋅ h 1 a 21 ⋅ f 1 +
a 12 a 11 ⋅ f 1+ a ⋅ e + a 11 ⋅ d 1+
a 21 a 22 a 23 a 11 a 13 a 12 ⋅ f 2+ a11 ⋅ e1+ a 12 ⋅ d 2+
a 22 ⋅ f 2 + 12 2
g 2 a 21 ⋅ f 1 a 13 ⋅ f3 a ⋅ e a 13 ⋅ d
g1 a 31 ⋅ g 1 g3 a 23 ⋅ f 3 h1 a ⋅ g
a 22 ⋅ f 2+
f3 13 3 3
g2 21 1
a 21 ⋅ e 1+ a ⋅ d + a 21 ⋅ c 1
a 31 ⋅ h 1 a 32 ⋅ g 2 + a 31 ⋅ f 1 + a 21 a 22 a 23 a 22 ⋅ e 2+ a21 ⋅ d 1+ a 22 ⋅ c 2
a 31 a 32 a 33 a 32 ⋅ f 2 +
22 2
a 23 ⋅ e 3 a ⋅ d a 23 ⋅ c
g1 f 2 a 31 ⋅ e 1 e3 23 3 3
a 31 ⋅ f 1 a 32 ⋅ e 2+
a 33 ⋅ f 3 a 31 ⋅ d 1+ a ⋅ c + a 31 ⋅ b 1
f1 f2 f3 a 31 a 32 a 33 a 32 ⋅ d 2+ a31 ⋅ c1+ a 32 ⋅ b 2
32 2 a 33 ⋅ b
a 33 ⋅ d 3 a ⋅ c 3
33 3
f1 e2 d3
Parallel
IN
6 References
µPI_IN(Parallel to Serial interface) [1] J.H. Moreno, T. Lang, Matrix Computations on
Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-
0 1 0 1 0 1 0 1 DATA_OUT 7923-9237-X, 1992.
‘0’= DATA_OUT
bypass [2] L. Bengtsson, “REMAP-γ: A Scalable SIMD
VLSI Architecture with Hierarchical Control”, PhD
PE14 PE13 PE12 PE11 thesis no. 320 , School of Electrical and Computer
µPI_OUT(Serial to Parallel interface)