0% found this document useful (0 votes)

80 views9 pages

Basic Matrix Operations On A DSP Array Architecture: September 2000

This document summarizes a research paper that describes a single instruction multiple data (SIMD) digital signal processing (DSP) array architecture for performing basic matrix operations. The architecture consists of processing elements arranged in a two-dimensional grid that are controlled by a central control unit. The processing elements are optimized to perform multiply-accumulate operations on fixed-point data. The paper discusses how the array can execute common matrix operations like matrix-vector multiplication, matrix addition/subtraction, and matrix transposition by broadcasting instructions from the control unit to the processing elements. Input data is fed into the array from the north and output is produced at the east side after computations are completed.

Uploaded by

AdiseshuMidde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views9 pages

Basic Matrix Operations On A DSP Array Architecture: September 2000

Uploaded by

AdiseshuMidde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331546964

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

Conference Paper · September 2000

CITATIONS READS
0 622

2 authors, including:

Lars Bengtsson
Chalmers University of Technology
39 PUBLICATIONS 270 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Hardware using RNS arithmetic View project

Energy efficient RFID protocols View project

All content following this page was uploaded by Lars Bengtsson on 02 May 2019.

The user has requested enhancement of the downloaded file.

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE
LARS BENGTSSON, STEFAN LUND
Halmstad University, Centre for Computer Systems Architecture,
Box 823, S-301 18 Halmstad, Sweden
[email protected], [email protected]

Abstract. and the timing and synchronization is given by the CU

Many processing tasks in signal and image process- broadcasting a central global clock signal. A status signal
ing can be reduced to a common set of basic matrix prim- “array status” (typically the “OR” sum of one status bit
itives, e.g. matrix/vector multiplication, matrix per PE) is read by the CU to monitor the state of the array.
transposition and inversion, and solving systems of equa- The PEs, interconnected as 2D array with nearest
tions. A highly parallel array architecture for such appli- neighbour connections, are optimized to perform MAC
cations is presented and it is shown how some frequently operations (fix point representation). Input data is re-
used matrix operations are performed. The array, consist- ceived as a vector, one vector item per array column, at
ing of PEs interconnected as a 2D-grid, executes instruc- the array top border. After completed processing, an out-
tions according to the SIMD (Single Instruction Multiple put vector is produced at the eastern array border.
Data) parallel computing style. It is scalable, both in
terms of problem size and when porting it to future down- External Input
scaled CMOS processes. 16 16 16 16
Keywords: Parallel DSP, Multi-channel signal
processing, Matrix computations, Architecture and im- 16
parallel-to-serial to/from CU
plementation. 16
1 1 1 1
1 16
PE array
1 Introduction 1 16

serial-to-parallel

External Output
1 16
The major computational requirements for many real-
time processing tasks in signal and image processing can
be reduced to a common set of basic matrix primitives
[1]. This set includes matrix/vector multiplication, ma-
trix/matrix multiplication and addition, matrix inversion,
1 16
solution of linear systems, eigen systems solution, matrix
decomposition (LU, QR and singular value decomposi- instructions array status
tion) and the generalized SVD algorithm.
This paper presents the way in which such matrix Control Unit
computations are performed on a parallel DSP array ar- (CU)
chitecture. Apart from complex operations such as matrix
inversion, matrix/vector multiplication and solving a sys- Figure 1 . The SIMD array architecture.
tem of linear equations, simpler operations, such as ma-
trix addition/subtraction and matrix transposition are 2.1 Processing Elements
presented. The processing elements used in the architecture use
a bit-serial data path including a (bit-parallel) register
2 The “REMAP-γ” DSP array architecture file. Each PE can address the registers in the file inde-
pendently of other PEs using an index register, X. The
Figure 1 shows the array architecture. Based on the global CU supplies the base-address (common to all
SIMD parallel computing model, a central Control Unit PEs), and each PE adds its own 8-bit offset (in the AC-
generates and issues instructions (control, address and CUMULATOR) to this base, yielding the final address.
timing) to the processors (PEs) in the array. All PEs re- This facilitates “local address modification”, useful in
ceive the same instruction and performs the same opera- (e.g. non-linear function) table lookup operations.
tion (a “minor” local control modification is possible). In Figure 2 shows the PE data path architecture. The
its classical definition, the model is fully synchronous central path of this is a bit-serial ALU with two 1-bit in-
put buses (A and B), a carry-in input and two 1-bit out- RBROADC (Row BROADCast) and RMAC (Row Mul-
puts (S and COUT). S is fed to an accumulator (shift tiply-and-ACcumulate). These instructions are described
register), to a 1-bit register (R) and to the index register. later in the text.
COUT (Carry out) is fed to a 1-bit register (C) for carry- Input to the array is read from the north side. On the
save. All data transport through the PE flows through this northern PE border, the N(orth) inputs are connected to
ALU. The data may be modified by the ALU or simply parallel/serial conversion registers. In this way, the
pass unaffected with the same time delay. A T-bit is used SOUTH instruction will push input data (16-bits) into the
for “tagged” (selective) memory writes. array from the North side. The SOUTH instruction also
simultaneously outputs the rightmost PE column to the
from Global from
Control Unit
neighbors
output interfaces. In this way, array input and output are
A B
1 1 N
performed concurrently.
16 Wd
Write data
X-Reg
E
W
Output from the array is also done by the EAST in-
S
(256*16) Address 8
S struction, which shifts (“pops”) the PE accumulator con-
tents one PE step to the east. The rightmost PEs (on the
Register File
eastern border) have their outputs connected to serial/
Read parallel conversion registers, which will serially be filled
data 1 ALU C
with 16-bit data. These registers are readable/writeable
16
by the global CU while the array is working.
1 S Cout
PS-Register to neighbor

R
N,E,W,S 2.2 The instruction set
This section describes the REMAP-γ instruction set.
to local CU
Instructions from the global CU are issued at the word
Q_Register T
1 data-level. Most instructions operate on 16-bit words.
16
2*16 bit ACCUMULATOR
However, some generate 32 bits (e.g. the RMAC and
CMAC instructions). The CU sends these instructions at
16 bit Serial/Parallel 16
Multiplier “low” speed (typ. 400/16 MHz), and they are interpreted
Wd and executed by the local CUs at “high” speed (typ. 400
MHz) at the bit-level. Table 1 below shows the 34 in-
Figure 2 . The PE data path architecture structions and their functions (the arithmetic instructions
use 2-complement form data values).
Each PE has a dedicated 16 bit, two’s-complement,
serial/parallel multiplier that makes multiplication in es- Table 1 . The Instruction Set
sence as fast (per bit) as addition/subtraction. It is fed by
16-bit parallel data (from Q-reg) and by serial input data Instruction Instruction
(from PS-reg), least significant bit first. The output data
is serially generated with least significant bit first. ADD (SUB) MAC
Figure 3 shows the multiplier structure. Add (sub) register PS to Multiply Q-register with PS-
(from) the ACC (16 bits) register and accumulate in
Q(0) Q(1) Q(2)
Q(14) Q(15) ACC. Result in ACC (32 bits).
Serial in
RBROADC (pe, REG) CBROADC (pe, REG)
& & & & &
Serial out PE in column # ‘pe’ broad- PE in row # ‘pe’ broadcasts its
S
FA
S
FA
S
FA
S
FA
S
FA casts its ‘REG’ (ACC high ‘REG’ (ACC high word, or
word, or PS) in each Row PS) in each Column
C C C C C

RMAC CMAC
Figure 3 . The 16-bit two’s-complement serial/parallel Multiply-and-ACcumulate Multiply-and-ACcumulate
multiplier each row. Result in ACC (32 each column. Result in ACC
bits) of PEs in the rightmost (32 bits) of PEs in the lowest
Communication with the neighbouring PEs takes column. row.
place through four wires input to the ALU input bus B RBROADC_DIAG (REG) CBROADC_DIAG (REG)
(North,East,South,West) and output through a multiplex- The PEs in the diagonal The PEs in the diagonal
er where one of three sources is selected. The first source broadcasts its ‘REG’ (ACC broadcasts its ‘REG’ (ACC
is the ACCUMULATOR and PS registers. This is used high word, or PS) in each Row high word or PS) in each Col-
umn
e.g. in the NORTH,SOUTH,EAST and WEST instruc-
tions, where data are shifted one PE step in the array. The RMIN/RMAX CMIN/CMAX
third source (the 1-bit register R) is the source in those in- Find the min./max value (in Find the min./max value (in
structions where data are passed through the PE on a bit- ACC) across each row. Result ACC) across each column.
in ACC of the rightmost col- Result in ACC of the lowest
by-bit basis. Examples of instructions that use this are
umn row
Table 1 . The Instruction Set time required is 30 cycles.
When the complete broadcasted word has been col-
Instruction Instruction lected in the respective PE, it raises its PE_READY sig-
nal, informing the global CU that the instruction has been
ADDX NORTH (REG),SOUTH
Add ACC (bits 23-16) to regis- (REG), EAST (REG), WEST
completed in this PE.
ter X. Result in X. (REG) The broadcast instructions utilize a “bit-level pipe-
Shift ‘REG’ (ACC32, ACC16 lined” approach to distribute data amongst rows and col-
or PS) to next PE in the array umns. Using only short nearest neighbour connections,
STORE STORE_T (ACC_HIGH,
scalability is maintained when porting the architecture to
(ACC_HIGH,ACC_LOW). ACC_LOW). Write the ACC future down-scaled CMOS deep-submicron IC-process-
Write the ACC (High Word or to local memory, if the T-bit is es [2].
Low Word) to local memory set.
(address in X)
2.2.2 The Multiply-and-ACcumulate
R_SELECTF C_SELECTF instructions:
Searches the rows of PEs for Searches the columns of PEs
T-bits set.Clears all but the for T-bits set. Clears all but The Multiply and Accumulate instructions imple-
first one starting from the the first one starting from the ment MAC operations either locally (MAC) or for each
rightmost column. bottom row.
row (RMAC) or each column (CMAC). The RMAC and
CLEAR INIT CMAC instructions use the same “bit-level pipelined”
Clear ACC, C, R, T and the Initialize the ROW and COL approach as the broadcast instructions. The principle is
multiplier S- and C-bits registers in each PE. the same as described in Figure 5, with the addition that
AND (OR) LOAD_X the bits are produced by the multipliers and accumulated
Logical AND (OR) the PS and Load the registers with data along the way by the ALUs.
ACC registers (16 bits). from the instr. parameter field Figure 6 illustrates this scheme.
(8-bits)
RMAC instruction starts in this column RMAC instruction ends in this column
LOAD_PS, LOAD_Q SHIFTACC (n)
Load the PS (Q) registers with Shift the ACC register n steps
data from the local memory to the right. bit-serial data flow

SHIFTACC_LEFT (n) DIV_RESTORE row R PE PE PE PE

Shift the ACC register n steps Used in Divide procedure.
to the left.
Col. no: 1 2 3 N
GRT LESS
Sets the T-bit if the PS register Sets the T-bit if the PS register
is greater (or equal to) than is less than the ACC register
Figure 6 . Illustration of the RMAC instruction
the ACC register (high word) (high word)
The time required for these instructions depends on
the array size. If the array size is N*N, the time required
for a RMAC or CMAC operation is N-1+log(N)+32 cy-
2.2.1 Broadcast instructions cles. For an array size of 16*16 PEs, this yields 51 clock
The broadcast instructions broadcast (using the PE cycles. For a 32*32 array, 68 cycles are required.
nearest neighbour connections only) the Accumulators or
PS registers in column ‘col’ (RBROADC) and in row 3 Basic matrix primitives
‘row’ (CBROADC) on each row and column respective-
This section presents mappings and performance of
ly. The ‘pe’ argument indicates which PE column
some selected basic matrix operations on the architec-
(RBROADC) or PE row (CBROADC) is the source of
ture.
the broadcast. The RBROADC_DIAG and
CBROADC_DIAG instructions use the PE elements of
the main diagonal as the source. Each PE uses the ‘pe’ ar- 3.1 Multiplying a vector with the transpose
gument (supplied in the instruction field) to determine if of a matrix
they are the source or one of the destinations. The source In some algorithms (e.g. artificial neural networks -
PEs start broadcasting immediately, and the destination ANN, it is necessary to, after first doing the ordinary ma-
PEs waits the appropriate number of clock cycles (deter- trix by vector multiplication (R = WX), also multiply the
mined by the difference between ‘pe’ and the respective result vector R with the transpose of the same matrix
PE ROW or COL number) before starting to receive the (WT). (EQ 1) illustrates these calculations.
16 bit long bitstream.
The maximum time for the broadcast instructions de-
pends on the array size. If the array size is N*N, the time
is (N-2)+16 cycles. For a typical size of 16*16 PEs, the
ri = ∑x w
j
j ij
LOAD_X (B) /* load X register with
address to B */
(EQ 1)
∑
T LOAD_PS /* load B elements into
si = r jw PS registers */
j
ADD () (or SUB) /* add/sub A and B items */
STORE (ACC_HIGH) /* store result in B */
These operations are easily executed by the array us-
ing the RMAC and CMAC instructions as shown in the
3.4 Matrix inversion
following example. The ordinary matrix vector multipli-
cation (R = WX) is first calculated (assuming the X vector To invert a matrix A, the Gauss-Jordan elimination
resides in the top row) using the “RMAC” instruction af- method may be used. This method is normally used to
ter which the result vector R resides in the eastern PE col- solve a linear system of equations. This section first de-
umn #N-1. The R vector is then broadcasted on the rows scribes how this solving can be done on the array. It is
and the CMAC instruction is used to create the result vec- then shown how this can be extended to perform matrix
tor S. inversion.

CBROADC (0,PS) 3.4.1 The Gauss-Jordan elimination method

RMAC /* perform R = WX */
This algorithm solves a system of linear equations
/* The accumulator is moved
with a variation of Gauss-elimination. No back-substitu-
to the PS register */
tion is necessary to complete the solving. All off-diagonal
RBROADC (N-1,PS)/* broadcast the R vector
elements are eliminated. The resulting matrix and vector
from
require only one division computation for each element to
column #N-1 */
produce the solution vector.
CMAC /* MAC across the columns*/
An example linear system of equations with three var-
iables is shown below.
The S vector now resides in the southern (highest
numbered) row and may be moved to the eastern border if
needed using only two further instructions (CBROADC
A 11 A 12 A 13 x1 B1
(N-1), RBROADC_DIAG). Normally, however, as is the
case in the ANN training algorithm “back-propagation”, A 21 A 22 A 23 × x2 = B2 (EQ 2)
this operation is one part in a long sequence and the result A 31 A 32 A 33 x3 B3
is not output but is used further in the calculations.
The algorithm is as follows (showing the elimination
3.2 Vector transpose of element A21):
Transposing a vector is straightforward and uses only
two instructions (assuming the vector resides in the top
•Calculate E = -A21 / A11
row). First, the vector is broadcasted on the columns by
the CBROADC instruction. Second, the •For J=1 to 3 Calculate A2J = A2J + A1J * E
RBROADC_DIAG instruction broadcasts the diagonal •Calculate B2 = B2 + B1 * E
elements on the rows. The transposed vector now resides
in the rightmost column.
This proceeds, one column at a time, until all off-diag-
The instruction sequence for vector transpose is:
onal elements have been eliminated. Finally, to obtain the
solution vector:
SOUTH (PS) /* input the vector */
CBROADC (0,PS) /* broadcast it on the •For i=1 to 3 Calculate Xi = Ai,i / Bi
columns*/
RBROADC_DIAG (PS)/* the diagonal PEs do a
row broadcast */ The general algorithm (in its sequential form) can be
EAST (PS) /* output the vector */ expressed using the following pseudo-code:

FOR c IN 1 TO NO_COLS LOOP /* for each

3.3 Matrix addition and subtraction
col*/
Addition (or subtraction) is done in parallel using the FOR r in 1 TO NO_ROWS LOOP /* for each
ADD (SUB) instruction. If two matrices, A and B, are to row
be added, the instructions issued are (assuming that the A IF r=c CONTINUE;/* skip to next row if r=c
matrix element resides in the ACCumulators): Er = -Arc/Acc;
FOR j IN 1 TO NO_COLS LOOP/* for
each column in A */
Arj = Arj + Acj*Er; FOR i IN 0 TO NO_COLS-1 LOOP /* For each
ENDLOOP; col
Br = Br + Bc*Er; Load_acc (A_B)
ENDLOOP; CBROADC (ACC,i) /* TEMPc
ENDLOOP; <=
FOR r in 1 TO NO_ROWS LOOP/* for each A Aic, or TEMPc <= Bi */
row*/ Store_acc_hw (TEMP)
Xr = Ar,r / Br Divide (A_B,TEMP)
ENDLOOP Store_acc_lw (E)
Multiply (H,E) /* clear all
Parallelizing this sequential code yields the follow- E values on row i*/
ing pseudo-code: RBROADC (ACC,i) /* broadcast
FOR i in 1 TO NO_COLS LOOP E values on the rows */
Broadcast row #i on the columns; Store_acc_hw (E) /* Store Er
Er = -Ari/Aii for each row r not equal to i. Ei = 0; <=
Broadcast Er on the rows; Arc/TEMPc * Hrc */
Arc = Arc + Aic*Er; Multiply (TEMP,E)
Br = Br + Bi*Er; Store_acc_hw (SLASK)
ENDLOOP
Load_acc (A_B)
No row pivoting (row swapping) is used here, and
LOAD_X (SLASK)
thus cases in which a diagonal element is zero (Aii=0; or
LOAD_PS
Bi=0) will produce a wrong result. Searching for the
SUB
maximum element and row swapping may be added to
Store_acc_hw (A_B) /* Store Arc
cope with this situation.
=
Arc - TEMPc*Er, or Br = Br - TEMPc*Er
3.4.2 Executing Gauss-Jordan on the array: */
This section shows the code segment needed to exe- Load_acc (H)
cute the algorithm. The coefficient matrix A and result SOUTH (ACC16)/* shift the help matrix
vector B are first loaded in the PE array by storing them down
in the PE local memories. The A matrix is stored in col- one PE step. All “ones” are
umns 0 to #NO_COLS-1, and the B vector is stored in shifted in to the top row. */
the rightmost column (#NO_COLS). Also, a help ma- Store_acc_hw (H)
trix, H, is stored with an initial contents of: ENDLOOP
Load_acc (A_B)
RBROADC_DIAG (ACC) /* get diago-
0 0 0 0 … 0 nal
1 1 1 1 … 1 elements*/
H = 1 1 1 1 … 1 (EQ 3) Store_acc_hw (SLASK)
… … … … … … Divide (A_B, SLASK) /* output
1 1 1 1 1… 1 vector
is now in ACC (low word) in col. 3 */
This help matrix is used to inhibit adjustment of row
#i during the elimination process. The procedure (macro) “Multiply” above contains
The elimination executed on the array is performed the following instructions:
column by column. It starts from the leftmost column Multiply (K,L):
(column #0), eliminating all off-diagonal elements in CLEAR
this column, and proceeds towards the rightmost col- LOAD_X (K)
umn. LOAD_Q
LOAD_X (L)
The instructions broadcast from the global CU are: LOAD_PS
MAC

Macros “Load_acc ()”, “Divide ()”, “Store_acc_hw

()” and “Store_acc_lw ()” loads the ACC, divides Q by
M, stores the ACC high word bits 31-16 (“hw”) or ACC
low word bits 15-0 (“lw”). using three separate methods, each with different per-
Solving a 31*31 equation system (using a 32*32 ar- formance characteristics regarding throughput, latency
ray) requires 51040 cycles. At 100 MHz clock frequency and the amount of hardware needed. These are:
this corresponds to 510 µsec.
1. Using column broadcast and the RMAC instruction
(“RMAC”).
2. Using a systolic type of processing with the local
3.4.3 Matrix inversion using Gauss-Jordan MAC instruction and column broadcast (“pseudo-
systolic 1”).
elimination
3. Using a systolic type of processing using local MAC
To show that the Gauss-Jordan elimination method and nearest neighbour PE communications only
can perform matrix inversion, consider the following: (“pseudo-systolic 2”).
First, use the relationship:
3.5.1 Matrix/vector multiplication using
–1
AA = I (EQ 4) column broadcast and the RMAC
where I is the unity matrix. Then, use the notation: instruction
The matrix/vector multiplication is basically per-
AX = B (EQ 5) formed by issuing two instructions, CBROADC(0) and
Identifying X and B in (EQ 4), yields that: RMAC (assuming the matrix is already loaded in the Q
registers of the PEs). First, the vector is input to row #0.
–1
Second, this vector is broadcasted on the columns using
X = A B = I (EQ 6) the CBROADC (0) instruction. Third, the RMAC in-
Thus, by using Gauss-Jordan elimination with matri- struction performs the multiply-and-accumulate between
ces A and B, where B is initialized as the unity matrix I, the matrix and the vector elements. The result vector is
we can calculate the inverse matrix A-1. found in the rightmost column. Figure 7 illustrates this
As an example, to find the inverse of the 3*3 matrix with matrix A, input vector X and output vector O.
A, we start with the following:
input vector
x1 x2 x3 x4
A 11 A 12 A 13 x 11 x 12 x 13 1 0 0
a 11 a 12 a 13 a 14 o1
A 21 A 22 A 23 x 21 x 22 x 23 = 0 1 0 (EQ 7)
A 31 A 32 A 33 x 31 x 32 x 33 0 0 1
a 21 a 22 a 23 a 24 o2

output vector
The A matrix is loaded in the first three columns of RMAC
the array. The unity matrix is loaded in the last three col- a 31 a 32 a 33 a 34 o3
umns. The elimination is now performed as in section
3.4.2, with the exception that three columns in the B ma-
trix are handled instead of only one. a 41 a 42 a 43 a 44 o4
The number of clock cycles required to invert a
32*32 matrix using a 32*64 array is 53771 cycles. At 100 CBROADC (0)
MHz clock frequency this corresponds to 538 µsec.
Mapping the algorithm this way means that the array Figure 7 . Matrix/Vector multiplication using column
is not quadratic. If this is not desirable, a quadratic array broadcast and RMAC
can be used if the B matrix is stored in the same PEs as In this method a new matrix by vector multiplication
the A matrix (in different memory positions). Of course, is not started until the previous one has completed. The
this slows down the algorithm execution, because the A main benefit is the very low latency (see Table 2) for a
and B elements can in this case not be calculated (adjust- specific vector, measured as the time delay between in-
ed) in parallel. put and output.

3.5 Performing Matrix - Vector multiplica-

3.5.2 Matrix/vector multiplication using the
tion
local MAC instruction and column
Many signal processing algorithms can basically be
broadcast (“pseudo-systolic 1”)
formulated as a matrix by vector multiplication problem
[3]. REMAP-γ can perform matrix/vector multiplication This method uses a systolic-like processing where the
input vectors are broadcasted on the columns and the ac-
cumulated sums are shifted east one PE step at a time PE datapath.
through the array. The local MAC instruction produces
the products and accumulates the sum. The result vector
input vectors
appears at the eastern output after some latency.

D
The procedure is as follows: the input vector is shift-

output vectors
D

from array
ed down (SOUTH) one PE step (this also inputs the next
D D

D
vector at the same time). The CBROADC_DIAG in-
struction is then used to broadcast the vectors on the re- to array

spective columns using the diagonal PE elements as Skew section Deskew section

sources. Next, the MAC instruction produces the product

and accumulates the sums. Finally, the accumulated Figure 9 . Skew and deskew registers needed in the
sums are shifted one PE step to the east (and at the same “pseudo-systolic 2” matrix/vector multiplication case
time output a result vector). Figure 10 shows how this matrix/vector multiplica-
Figure 8 shows this principle with matrix A and the tion is performed by the array. The matrix is stored in the
input vector stream F, G, H, I,...... PEs’ Q registers, i.e. one matrix element in each PE. In
In the systolic methods where vectors are pipelined each step, the vectors are shifted one step south, and the
through the array, a new result is produced in each loop accumulated sums are shifted one step to the east. A re-
(after N initial loops). However, there is a “high” latency sult is produced in each step, and the resulting vector is
(equal to N loops) for a given vector. found by deskewing the resulting vectors from the east
edge PEs.
i1 i2 i3

a 11 ⋅ g 1
a 11 ⋅ f 1 + k1 j2
a 11 ⋅ h 1 a 12 ⋅ g 2 + i3
Input Vector
a 11 a 12 a 13 a 12 ⋅ f 2 + j1 i2 h3
a 13 ⋅ f 3 i1 h2 g3
h1 h2 a 21 ⋅ g 1 h a 11 ⋅ g 1
a 22 ⋅ g 2 + 3 a 11 ⋅ h 1 a 12 ⋅ g 2+
a 21 ⋅ h 1 a 21 ⋅ f 1 +
a 12 a 11 ⋅ f 1+ a ⋅ e + a 11 ⋅ d 1+
a 21 a 22 a 23 a 11 a 13 a 12 ⋅ f 2+ a11 ⋅ e1+ a 12 ⋅ d 2+
a 22 ⋅ f 2 + 12 2
g 2 a 21 ⋅ f 1 a 13 ⋅ f3 a ⋅ e a 13 ⋅ d
g1 a 31 ⋅ g 1 g3 a 23 ⋅ f 3 h1 a ⋅ g
a 22 ⋅ f 2+
f3 13 3 3
g2 21 1
a 21 ⋅ e 1+ a ⋅ d + a 21 ⋅ c 1
a 31 ⋅ h 1 a 32 ⋅ g 2 + a 31 ⋅ f 1 + a 21 a 22 a 23 a 22 ⋅ e 2+ a21 ⋅ d 1+ a 22 ⋅ c 2
a 31 a 32 a 33 a 32 ⋅ f 2 +
22 2
a 23 ⋅ e 3 a ⋅ d a 23 ⋅ c
g1 f 2 a 31 ⋅ e 1 e3 23 3 3
a 31 ⋅ f 1 a 32 ⋅ e 2+
a 33 ⋅ f 3 a 31 ⋅ d 1+ a ⋅ c + a 31 ⋅ b 1
f1 f2 f3 a 31 a 32 a 33 a 32 ⋅ d 2+ a31 ⋅ c1+ a 32 ⋅ b 2
32 2 a 33 ⋅ b
a 33 ⋅ d 3 a ⋅ c 3
33 3
f1 e2 d3

Figure 8 . “pseudo-systolic 1” Matrix/Vector

multiplication a11 a12 a13 d1 a11 ⋅ d 1 + a12 ⋅ d 2 + a13 ⋅ d 3
a21 a22 a23 ⋅ d2 = a21 ⋅ d 1 + a22 ⋅ d 2 + a23 ⋅ d 3
3.5.3 Using systolic processing and skew/
deskew external registers (“pseudo-systolic a31 a32 a33 d3 a31 ⋅ d 1 + a32 ⋅ d 2 + a33 ⋅ d 3
2”)
Figure 10 . “pseudo-systolic 2” matrix/vector
This method uses systolic processing where the input multiplication with skew/deskew of in- and outdata
vector is delayed (skewed) and the output vector is
deskewed according to Figure 10. The local MAC in- Table 2 shows a comparison of the performance of
struction is used to create the local product and add this the three matrix-vector multiplication methods in terms
to the accumulated sum shifted in from the left PE neigh- of sustained throughput and latency. The clock frequency
bour. The column broadcast is not necessary here as in assumed is 100 MHz and the array size is 32*32.
the first two methods.
This method yields the highest throughput but has a Table 2. Matrix/vector performance on a
high latency (although not as high as “pseudo-systolic 32*32 array (@ 100 MHz)
1”). It requires extra hardware to skew/deskew the input
and output vectors (shown in Figure 9). Each of these Sustained
Latency
registers has a size equal to the PS (or Q) register in the Method throughput Latency (µs)
(clock cycles)
(GOPS)

“RMAC” 1.4 146 1.46

“pseudo- 1.6 4032 40

systolic 1”
Table 2. Matrix/vector performance on a registers.
32*32 array (@ 100 MHz) Table 2 below summarizes the most important design
parameters of the 8*8 test design.
Sustained
Latency
Method throughput
(clock cycles)
Latency (µs) Table 3 . Summary of test chip (8*8) data
(GOPS)
Chip parameter Data
“pseudo- 2.56 2560 25.6
Technology 0,7 µm CMOS Double
systolic 2”
Layer Metal
Clock frequency (MHz) 100
As Table 2 reveals, the “pseudosystolic 2” method is 2
Array area (mm ) 225 (15 x 15)
superior with respect to throughput (but requires extra
skew- and deskew registers), and the “RMAC” method is Number of cells (excluding 80450
register memory)
superior with respect to latency.
Estimated power dissipa- < 12
tion (W)
4 VLSI Test implementation
Two test prototype chips, one 16 PEs (4*4) and one 64 Using scaling rules for CMOS [4] and [5], it is esti-
PEs (8*8) have been designed using VHDL synthesis and mated that when using a state-of-the-art CMOS process
standard cells. The technology used was the ES2 0.7 mi- (0,18 µm), approximately four times higher clock speed
cron N-well CMOS double layer metal process. The phys- (400 MHz) and 16 (42) times smaller area should be ex-
ical layout was created with Cadence place&route tools pected. Thus, a 32*32 array would fit in one single chip.
which resulted in an array area of 225 mm2 (8*8 PEs) and
a clock speed of 100 MHz. 5 Conclusions
The chip (4*4) block diagram is shown in Figure 11.
As can be seen, the parallel-to-serial and serial-to-parallel This paper has shown the mapping and performance
conversion interfaces at the northern and eastern borders of basic matrix operations on a novel parallel DSP array
are included on-chip. These interfaces can, through the architecture. These matrix operations, basic in many
use of multiplexers, be bypassed on those PE chips that signal processing algorithms, include matrix inversion,
are not placed at the array borders when a multiple-chip matrix/vector multiplication, solving systems of linear
array is constructed. equations, matrix addition/subtraction and matrix
The input interface (µPI_IN) includes four parallel-in/ transposition. Performance figures were given and
serial-out shift registers. The output interface (µPI_OUT) compared in terms of throughput, latency and execution
includes four serial in/parallel out shift registers. times. Data for a VLSI test chip in 0,7 µm CMOS was
presented and estimations shown regarding performance
and chip size using a state-of -the-art process.
DATA_IN 2
DATA_IN 1
DATA_IN 4
DATA_IN 3

Parallel
IN
6 References
µPI_IN(Parallel to Serial interface) [1] J.H. Moreno, T. Lang, Matrix Computations on
Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-
0 1 0 1 0 1 0 1 DATA_OUT 7923-9237-X, 1992.
‘0’= DATA_OUT
bypass [2] L. Bengtsson, “REMAP-γ: A Scalable SIMD
VLSI Architecture with Hierarchical Control”, PhD
PE14 PE13 PE12 PE11 thesis no. 320 , School of Electrical and Computer
µPI_OUT(Serial to Parallel interface)

Engineering, Chalmers University of Technology,

PE24 PE23 PE22 PE21
Gothenburg, Sweden, 1997.
[3] H.T. Kung, C.E. Leiserson, “Algorithms for VLSI
Parallel out

Processor Arrays”, In “Introduction to VLSI Sys-

PE34 PE33 PE32 PE31 tems”, Mead & Conway, Addition-Wesley, 1980.
[4] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L.
Rideout, E. Bassous, A.R. LeBlanc, “Design of ion-
PE44 PE43 PE42 PE41 implanted MOSFET’s with very small physical
DATA_OUT dimensions”, IEEE J. Solid-State Circuits, Vol. SC-9,
DATA_OUT
p.256, 1974.
[5] K.C.Saraswat, F. Mohammadi, “Effect of Scaling
Figure 11 Test chip (4*4) block diagram. of Interconnections on the Time Delay of VLSI Cir-
cuits”, IEEE J. of Solid-State Circ., vol. SC-17, no. 2,
The 8*8 test chip has the same type of block diagram
pp. 275-280, April 1982.
but has 64 PEs and eight input registers and eight output

View publication stats

Chapter Shutdown
No ratings yet
Chapter Shutdown
31 pages
DSP Presentation Overview For Class
100% (1)
DSP Presentation Overview For Class
71 pages
Class 12 Communication Skills Q&A
No ratings yet
Class 12 Communication Skills Q&A
5 pages
Computer Science 2
No ratings yet
Computer Science 2
24 pages
Antennas & Wave Propagation Guide
0% (1)
Antennas & Wave Propagation Guide
32 pages
School Memorandum No.22, S. 2020 ICT Training For Teachers
No ratings yet
School Memorandum No.22, S. 2020 ICT Training For Teachers
3 pages
Critical Region
No ratings yet
Critical Region
7 pages
03 Mind Map Theory
No ratings yet
03 Mind Map Theory
24 pages
Narrative Report
No ratings yet
Narrative Report
2 pages
Black Box Fairness Testing of Machine Learning Models
No ratings yet
Black Box Fairness Testing of Machine Learning Models
11 pages
QMS Internal Audit Checklist Demo
No ratings yet
QMS Internal Audit Checklist Demo
4 pages
ICTCYS604 Project Portfolio Best Practices Identify Managment JPSR
No ratings yet
ICTCYS604 Project Portfolio Best Practices Identify Managment JPSR
20 pages
Portable Percent Oxygen Analyzer With USB Data Logging
No ratings yet
Portable Percent Oxygen Analyzer With USB Data Logging
1 page
Fuzzy Logic for Computing Students
No ratings yet
Fuzzy Logic for Computing Students
69 pages
Lecture Ch4 Performance
No ratings yet
Lecture Ch4 Performance
25 pages
PST 3
No ratings yet
PST 3
3 pages
Example-Self Safety Inspection Checklist - QA
100% (1)
Example-Self Safety Inspection Checklist - QA
3 pages
Overlay
No ratings yet
Overlay
3 pages
Cyberbullying Detection Based On Emotion
No ratings yet
Cyberbullying Detection Based On Emotion
1 page
3HAC16591 en
No ratings yet
3HAC16591 en
234 pages
VOSviewer: Advanced Text Mining
No ratings yet
VOSviewer: Advanced Text Mining
5 pages
Towards Large-Scale Small Object Detection: Survey and Benchmarks
No ratings yet
Towards Large-Scale Small Object Detection: Survey and Benchmarks
24 pages
MBA Curriculum Overview - Bharathidasan University
No ratings yet
MBA Curriculum Overview - Bharathidasan University
41 pages
FPGA Programming Guide for Beginners
No ratings yet
FPGA Programming Guide for Beginners
11 pages
Module 3 Notes
No ratings yet
Module 3 Notes
26 pages
Allied Meditec 1100 October 2023 Ver23-10
No ratings yet
Allied Meditec 1100 October 2023 Ver23-10
2 pages
Comparing Functions Answered
No ratings yet
Comparing Functions Answered
14 pages
Distributed Computing
No ratings yet
Distributed Computing
3 pages
Chapter 00
No ratings yet
Chapter 00
20 pages
Rig No.: 314 Well Name: Date: 0.00 Drill Pipe: 0.00 Bha: 0.00 Kelly: Depth 0.00 Page #: 1
100% (1)
Rig No.: 314 Well Name: Date: 0.00 Drill Pipe: 0.00 Bha: 0.00 Kelly: Depth 0.00 Page #: 1
7 pages
ORGANIZING
No ratings yet
ORGANIZING
36 pages
AWP (Antenna Measurement) PDF
No ratings yet
AWP (Antenna Measurement) PDF
94 pages
Program 8queens
No ratings yet
Program 8queens
1 page
Embedded Systems Notes
No ratings yet
Embedded Systems Notes
13 pages
NM Plus Hydrogen Generator: Carrier Grade
No ratings yet
NM Plus Hydrogen Generator: Carrier Grade
4 pages
Academic Writing Intro Course
No ratings yet
Academic Writing Intro Course
6 pages
DSP Lab Manual DSK Technical Programming With C, MATLAB Programs 2008 B.Tech ECE IV-I JNTU Hyd V1.9
80% (5)
DSP Lab Manual DSK Technical Programming With C, MATLAB Programs 2008 B.Tech ECE IV-I JNTU Hyd V1.9
52 pages
Effects of E-Learning On Intrinsic Motivation of Grade 11 Accountancy Business Management (ABM) Students in Agusan Sur National High School
No ratings yet
Effects of E-Learning On Intrinsic Motivation of Grade 11 Accountancy Business Management (ABM) Students in Agusan Sur National High School
4 pages
Advanced Digital Signal Processing (ADSP) For Decreasing Power Dissipation
No ratings yet
Advanced Digital Signal Processing (ADSP) For Decreasing Power Dissipation
4 pages
DSP - Presentation - Sumit 3
No ratings yet
DSP - Presentation - Sumit 3
63 pages
Advanced Eigrp Concepts: CCNP ROUTE: Implementing IP Routing
No ratings yet
Advanced Eigrp Concepts: CCNP ROUTE: Implementing IP Routing
19 pages
Unit-5 DSP Processor
No ratings yet
Unit-5 DSP Processor
28 pages
DSP Architecture
100% (1)
DSP Architecture
71 pages
VVDI PROG User Manual Guide
No ratings yet
VVDI PROG User Manual Guide
80 pages
DSP Processors
100% (1)
DSP Processors
24 pages
DSP - Presentation - Sumit 2
No ratings yet
DSP - Presentation - Sumit 2
68 pages
Advantages of Nwell
100% (2)
Advantages of Nwell
1 page
Introduction To Digital Signal Processors (DSPS) - Student
No ratings yet
Introduction To Digital Signal Processors (DSPS) - Student
24 pages
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
No ratings yet
Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec
8 pages
DSP Architectures
No ratings yet
DSP Architectures
71 pages
VLSI Architecture for Engineers
No ratings yet
VLSI Architecture for Engineers
8 pages
SHARC DSP Core Design in Verilog
No ratings yet
SHARC DSP Core Design in Verilog
6 pages
Unit 5
No ratings yet
Unit 5
71 pages
SIMD Computer Organizations
0% (1)
SIMD Computer Organizations
20 pages
DSP - Presentation - Sumit 1
No ratings yet
DSP - Presentation - Sumit 1
71 pages
DSP Architecture & Processors Overview
No ratings yet
DSP Architecture & Processors Overview
97 pages
DSP Architecture PDF
No ratings yet
DSP Architecture PDF
3 pages
DSP PPT Mod3
No ratings yet
DSP PPT Mod3
113 pages
Digital Signal Processing Advanced
No ratings yet
Digital Signal Processing Advanced
14 pages
02 Architecture of Arm
No ratings yet
02 Architecture of Arm
43 pages
DSP Techniques For Radio Amateurs
No ratings yet
DSP Techniques For Radio Amateurs
32 pages
Typical Characteristics: Microprocessor Digital Signal Processing
No ratings yet
Typical Characteristics: Microprocessor Digital Signal Processing
3 pages
DSP Architecture for Engineers
No ratings yet
DSP Architecture for Engineers
25 pages
Lecture25 PDF
No ratings yet
Lecture25 PDF
13 pages
Antennas Propagation
No ratings yet
Antennas Propagation
40 pages
.Model Small .Stack 200 .Code Start: Mov Ax, 1111h Mov BX, 1111h Add Ax, BX Int 03h End Start
No ratings yet
.Model Small .Stack 200 .Code Start: Mov Ax, 1111h Mov BX, 1111h Add Ax, BX Int 03h End Start
1 page
Chap 15
No ratings yet
Chap 15
61 pages
Synthesis of Computational Structures For Analog Signal Processing
No ratings yet
Synthesis of Computational Structures For Analog Signal Processing
456 pages
Digital Signal Processor: Architecture
No ratings yet
Digital Signal Processor: Architecture
3 pages
Digital Signal Processors: Inderdeep Kaur Aulakh Asst. Prof. (IT), UIET Pu, CHD
No ratings yet
Digital Signal Processors: Inderdeep Kaur Aulakh Asst. Prof. (IT), UIET Pu, CHD
19 pages
DSP Architecture - Part 1
No ratings yet
DSP Architecture - Part 1
36 pages
DSP 5th Unit
No ratings yet
DSP 5th Unit
26 pages
Factored Systolic Array Tensor Processing
No ratings yet
Factored Systolic Array Tensor Processing
7 pages
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
No ratings yet
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
13 pages
Cordic SVD HW
No ratings yet
Cordic SVD HW
6 pages
Unit 3 Programmable Digital Signal Processors
No ratings yet
Unit 3 Programmable Digital Signal Processors
25 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
DSP-8 (DSP Processors)
No ratings yet
DSP-8 (DSP Processors)
8 pages
01 Introduction
No ratings yet
01 Introduction
29 pages
DSP Processors for Engineers
No ratings yet
DSP Processors for Engineers
43 pages
Unit 5
No ratings yet
Unit 5
24 pages
Chap 15
No ratings yet
Chap 15
60 pages
BITS Pilani: Digital Signal Processing
No ratings yet
BITS Pilani: Digital Signal Processing
73 pages
DSP Lab Manual for ECE Students
100% (21)
DSP Lab Manual for ECE Students
47 pages
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
No ratings yet
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
12 pages
Digital Signal Processing (DSP)
No ratings yet
Digital Signal Processing (DSP)
63 pages
Smart Card:: Smart Cards-What Are They?
No ratings yet
Smart Card:: Smart Cards-What Are They?
12 pages
Digital Signal Processor Fundamentals and System Design: M.E. Angoletta
No ratings yet
Digital Signal Processor Fundamentals and System Design: M.E. Angoletta
63 pages
DSP Processors
No ratings yet
DSP Processors
114 pages
Digital Signal Processing Basics
No ratings yet
Digital Signal Processing Basics
18 pages
Architecture
No ratings yet
Architecture
112 pages

Basic Matrix Operations On A DSP Array Architecture: September 2000

Uploaded by

Basic Matrix Operations On A DSP Array Architecture: September 2000

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

Conference Paper · September 2000

Hardware using RNS arithmetic View project

Energy efficient RFID protocols View project

The user has requested enhancement of the downloaded file.

Abstract. and the timing and synchronization is given by the CU

SHIFTACC_LEFT (n) DIV_RESTORE row R PE PE PE PE

CBROADC (0,PS) 3.4.1 The Gauss-Jordan elimination method

FOR c IN 1 TO NO_COLS LOOP /* for each

Macros “Load_acc ()”, “Divide ()”, “Store_acc_hw

3.5 Performing Matrix - Vector multiplica-

sources. Next, the MAC instruction produces the product

Figure 8 . “pseudo-systolic 1” Matrix/Vector

“RMAC” 1.4 146 1.46

“pseudo- 1.6 4032 40

Engineering, Chalmers University of Technology,

Processor Arrays”, In “Introduction to VLSI Sys-

View publication stats

You might also like