Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views11 pages

Programmable DSP Architectures - Part 2

This paper discusses the architectural features of single-chip programmable digital signal processors (DSPs), focusing on pipelining techniques such as interlocking, time-stationary coding, and data-stationary coding. It illustrates how these techniques impact performance and programmability, using examples from DSPs by AT&T, Motorola, and Texas Instruments. The paper concludes with predictions for future trends in DSP architectures.

Uploaded by

Joao Paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Programmable DSP Architectures - Part 2

This paper discusses the architectural features of single-chip programmable digital signal processors (DSPs), focusing on pipelining techniques such as interlocking, time-stationary coding, and data-stationary coding. It illustrates how these techniques impact performance and programmability, using examples from DSPs by AT&T, Motorola, and Texas Instruments. The paper concludes with predictions for future trends in DSP architectures.

Uploaded by

Joao Paulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Programmable DSP Architectures:

Part I1
EdwardA. l e e

This two-part paper explores the architectural features of dorsement. The reader is urged to contact the manufac-
single-chip programmable digital signal processors (DSPs) turers for complete and up-to-date specifications, and
that make their impressive performance possible. Part I, not to rely on the data presented in this paper.
which appeared in the previous issue of ASSP Magazine, dis- 2. PIPELINING
cussed arithmetic and memory organizations. This part dis-
cusses pipelining. Three distinct techniques are used for A typical programmable DSP has instructions that will
fetch two operands from memory, multiply them, add
dealing with pipelining, interlocking, time-stationary coding,
and data-stationary coding. These techniques are studied in them to an accumulator, write the result to memory, and
light of the performance benefit and the impact on the user. post-increment three address registers. It i s obvious that
if all these operations had to be done sequentially within
As in part I, representative DSPs from AT&T, Motorola, and
one instruction cycle, the instruction cycle times would be
Texas Instruments are used to illustrate the ideas. It i s not
the intent of the author to catalog available DSPs nor their much longer than they are. Fast execution is accomplished
using pipelining.
features, nor to endorse particular manufacturers. It is the
intent to compare different solutions to the same problems. Pipelining effectively speeds up the computation, but it
The paper concludes with a discussion of trends and some can have a serious impact on programmability. There are
three fundamentally different techniques for dealing with
bold predictions for the future.
pipelining in a programmable processor: interlocking,
1. INTRODUCTION time-stationary coding, and data-stationary coding. The
TI DSPs primarily use interlocking, the Motorola DSPs
In Part I of this paper, which appeared in the previous and the AT&T DSP16/16A use time-stationary coding, and
issue of ASSP Magazine, we found that programmable the AT&T DSP32/32C use data-stationary coding. As with
DSPs use multiple memory banks in order to get adequate most taxonomies, the boundaries between the categories
memory bandwidth. Several variations on the basic are not rigid, and most DSPs have some of the flavor of
Harvard architecture were described, but they all have all three.
one feature in common; an instruction is fetched at the
2.1. Interlocking.
same time that operands for a previously fetched instruc-
tion are being fetched. This can be viewed as a form of One philosophy is that the programmer should not be
pipelining, where the instruction fetch, operand fetch, bothered with the internal timing or parallelism of the
and instruction execution form a three-stage pipeline. architecture. Programs should be written in an assembly
Close examination, however, shows that most DSPs are not language in which the programmer can assume that every
so simple. This paper examines the timing of instructions action specified in one instruction completes before the
in DSPs, revealing intricacies and subtleties that easily next instruction begins. Furthermore, each instruction
evade the casual observer. In addition, trends are dis- should completely specify its operand memory locations
cussed, complete with predictions for the future. and the operation performed. The processor may be
Most of the examples used in this paper come from pipelined, but it should not act as if it were so.
one of the DSPs in Table 1, reproduced from Part I. Other A simple model for the pipelining of a programmable
important DSPs are listed in Table 2 of Part I. Most of the processor divides the instruction execution into instruc-
architectural features of the DSPs in Table 2 of Part I are tion fetch, decode, operand fetch, and execute stages, as
also represented in Table 1, so their explicit inclusion in shown in Figure 1. In the figure, the cross-hatched boxes
this paper would be redundant. The choice of DSPs in indicate latches, which latch signals once per instruction
Table 1 stems primarily from the familiarity of the author cycle. The instruction fetch occurs at the same time that
with the devices, and should not be construed as an en- the previous instruction is being decoded, and at the same
time that the operands for the instruction before that are
The views expressed in this paper are those of the author and do
being fetched. The trick is to overlap instructions in this
not reflect an endorsement or policy of the ASSP Society, the Publi- way and still gave the impression that every instruction
cations Board or the ASSP Magazine editorial personnel. finishes before the next instruction begins.

4 IEEE ASSP MAGAZINE JANUARY 1989 0740-7467/89/0100-0004S1.OO 0 1989 IEEE

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
Example 1 fetched. They may be fetched from two different
The TMS320C30 conforms well with the pipeline model memories, as shown, or from the same memory. The
of Figure 1. Consider the parallel multiply and add in- DDFITA bus is used to transfer the operands to the
struction (see Example 18 of Part I for a program using arithmetic units. Finally, the multiply and add proceed
this instruction). Its timing i s shown in Figure 2 using a in parallel, consuming a complete instruction cycle. A
reservation table. Hardware resources are listed on the similar instruction can be fetched every cycle without
left and time increases to the right. First the instruction any conflict for resources.
i s fetched. We assume internal memory is used, in Although the execution of the instruction is scattered
which case only half an instruction cycle i s required for over four cycles, it i s important the programmer be un-
the fetch, but time i s available for an external access, aware of this. A store instruction that follows the parallel
which would require a full instruction cycle. Then two multiply and add must be able to store either the result
parallel address arithmetic units are used to compute of the multiply or the add (or both) without any delay.
the operand addresses. The TMS320C30 provides
indexed addressing, in which an index must be added
to the address prior the fetch, so computing operand
addresses i s non-trivial. After this, the operands are

JANUARY 1989 IEEE ASSP MAGAZINE 5


-

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
The earliest possible time that such a store could occur
is in the fifth instruction cycle of Figure 2. In Figure 3, a
store to RAM2 is shown occurring at that time. Also
shown in Figure 4a. The ADD instruction adds an operand
shown, cross-hatched, is a third instruction, fetched in
from data memory to the accumulator. The instruction
the third instruction cycle, that reads from RAM2. In
cycle is divided into four subcycles, and internal mem-
order to hide the pipelining from the programmer, it is
ory accesses are completed in three subcycles. A total
essential that this instruction be able to read the data
of two instruction cycles is required, although it is clear
just stored. With the timing shown this occurs.
from the reservation table that ADD instructions can
There are many possible variations on the instructions
be executed at the rate of one per instruction cycle
shown in Figure 2 and Figure 3. Suppose, for example,
without any resource conflict. Furthermore, an ADD in-
that the arithmetic instruction in Figure 3 required two
struction can use the result (stored in the accumulator
operands from RAM2. If the immediately preceding
register) of the immediately preceding ADD instruc-
instruction (the store) were not using RAM2, there
tion. So there is no pipeline hazard.
would be no problem, but as it is, there would be con-
Notice that the actual addition takes only half an in-
tention for R A M 2 in the fifth instruction cycle of Fig-
struction cycle. This must be so because the ADD in-
ure 3. In this event, the control hardware will delay
struction can be followed by a S A C H instruction that
the execution of the arithmetic instruction. This delay
stores the result of the ADD (in the accumulator) to
is called interlocking. In fact, in the TMS320C30, con-
memory. The timing of the S A C H instruction is shown
tention for resources is not uncommon, and the control
in Figure 4b. The accumulator must be valid at the time
hardware delays the execution of instructions in order
marked in Figure 4b if the SACH instruction is to work
to resolve it. The programmer need not be aware of this,
properly. The time marked is precisely the end of the
but obviously performance will be degraded. Interest-
addition in Figure 4a. Furthermore, the S A C H instruc-
ingly, TI supplies a simulator that gives the detailed
tion may be followed by an ADD that uses the data
timing of any sequence of instructions, so that pro-
value just stored to memory; this might be foolish, but
grammers intent on optimizing their code can do so
it is certainly permissible. This determines that the write
easily.
must be completed no later than shown in Figure 4b,
Of the DSPs discussed here, the ones that make most so that the read of a following ADD instruction reads
use of interlocking are the TI processors. The internal valid data. For this sequence of instructions (arithmetic,
timing of these devices is quite elaborate, and varies de- store, arithmetic) to work without evident pipelining, it
pending on whether internal or external memories are i s necessary that a write, the arithmetic, and a read
being used, and sometimes changes in complicated ways complete within two instruction cycles.
during the execution of a program. Nevertheless, we can
The previous examples illustrate two important con-
gain intuition by considering a few more examples.
cepts. First, the execution of an instruction need not be
Example 2 constrained to one instruction cycle in order to appear
Careful examination of the TMS32010/20/C25 architec- constrained to one instruction cycle. Second, the internal
tures suggests that the ADD instruction is executed as timing of the DSP architecture can be inferred by carefully

6 iEEE ASSP MAGAZINE JANUARY 1989

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
considering the requirements of different sequences of multiplication in Figure 5b has a full instruction cycle to
instructions. complete i t s operation. Notice that execution of this in-
Example 3 struction actually spills into a third instruction cycle.
Example 4
It i s instructive to consider the FIR filter code for the
TMS32010, reproduced from Example 15 of Part I : The TMS32020 and TMS320C25 have a more compact
construct for FIR filtering using the RPTK and MACD
LARK A R 0 ,address o f last coefficient.
instructions:
LARK A R 1 ,address o f last data word.
LARP 0 R P TK constant
LT *-,AR1 MACD m?,m2
MPY *-,ARO One possible timing of the first MACD instruction is
LTD *-,&RI shown in Figure 6. In this case, both the multiplication
MPY *-,FIR0 and addition could begin earlier, but as shown their
iib *-,FIR1 timing coincides with that of the MPY and ADD instruc-
MPY * - ,ARO tions, so the control hardware is probably simpler this
APAC way. As shown, the instruction consumes two instruc-
ADD ONE,14 tion cycles before the next instruction can be fetched.
SACH RESULT, 1 If the instruction i s fetched from the unit length instruc-
We will later compare the timing of this implementation tion cache, however, then the doubly cross-hatched
to the faster and more compact code using the RPTK instruction fetch in Figure 6 i s not required and only
and MACD instructions. The heart of the code i s the one instruction cycle is consumed. This i s how these
alternating L T D and MPY instructions. The L T instruc- architectures achieve FIR filtering i n one instruction
tion loads the T register with a value from memory (see cycle per tap.
Figure 1 of Part I). The L T D instruction does the same
thing, but in addition, the value loaded into the T reg-
ister is copied into the memory location above where it
came from (to implement a delay-line shift) and the
product register is added to the accumulator. One pos-
sible timing for the L T D instruction is shown in Figure
5a. The addition could actually occur earlier, since it
does not depend on the operand fetched, but control
hardware i s probably simpler if i t i s positioned as
shown because of its similarity to the ADD instruction
in Figure 4a. One possible instruction timing for the
MPY i s shown in Figure 5b. The L T D and MPY instruc-
tions can alternate as in Example 3 without conflict for
resources, and all required data is available on time.
Because of the late start of the addition in Figure 5a, the

2.2. Time-Stationary Coding.


Although clearly beneficial for the programmer, inter-
locking has its costs. Higher performance can often be
obtained by giving the programmer more explicit control
over the pipeline stages. The most common way to do
this is using time-stationary coding, in which an instruc-
tion specifies the operations that occur simultaneously in
one instruction cycle.
Several DSPs are built around the rough outline of a
reservation table shown in Figure 7. An instruction would
explicitly specify three (or more) operations to be per-
formed in parallel, two memory fetches and one (or more)
arithmetic operations. Referring back to Figure 1, each in-
struction specifies simultaneous operand fetch and execute
operations, rather than successive operand fetch and exe-
cute operations. In essence, the program model is one of
parallelism rather than pipel ining.
Example 5
A multiply and accumulate instruction for the DSP56001

JANUARY 1989 IEEE ASSP MAGAZINE 7

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
and add instruction is:
MOV LKRO ,ROM
ADDF WRO,M
INCBPO
I NCRP ;
Here, four fields can be specified on four separate lines,
and a semicolon groups the fields. The first line speci-
fies a move of two operands from memory (ROM and
Figurn 7. An outline of a by av- RAM) into the L and K registers. Meanwhile, the cur-
erwl DSPs that use tima-statkmq wdkq. rent contents (before the move) of the L and K regis-
ters are multiplied. N o mnemonic i s given for the
or 96002 is: multiplication because it occurs automatically in every
cycle regardless of whether its result i s used. Mean-
MAC XO,YO,A X:(RO)+,XO Y:CR4)-,YO
while, the product register M (from multiplication in the
There are three fields in this instruction, one specifying previous instruction) is added to the working register
the arithmetic operation, and the other two specifying W R O , which acts as an accumulator. The last two lines
operand fetches for the next instruction. The operands specify the auto-increment for the pointers to memory
of the arithmetic operation are the contents of the X 0 (ROM and RAM) that are used in the first line. The four
and Y 0 registers, which were loaded from memory in a fields can also be specified together on one line.
previous instruction.
Example 7.
The result of the multiplication is added to the A reg-
ister. Unlike any other DSP, the DSP56001 has integrated The AT&T DSP16 and DSP16A use a reservation table
an adder into the multiplier, so that multiplication and outlined in Figure 8 for instructions with two operands
accumulation are not two successive operations but ac- from memory. A typical instruction is:
tually occur together in the same hardware. The MCIC aO=aO+p p=x'y y=*rO++ x='pt+t
instruction shown here multiplies the contents of X 0
The product register p from the previous multiplica-
and Y 0, simultaneously adding the result to A, so that
tion i s added to a 0 at the same time that a new prod-
in the next cycle, A has been completely updated. This
uct is formed using the contents of the x and y regis-
i s possible because multiplier hardware can be easily
ters. Meanwhile, the x and y registers are loaded with
modified so that as it multiplies it also adds a number
new values using the address registers r 0 and p t . This
to the product. Unfortunately, this is more difficult to
processor only has a demand ratio of two, so two-
accomplish with floating point arithmetic, so in the
operand instructions consume two instruction cycles,
DSP96002 floating point multiplication and addition can
as shown in Figure 8. However, if the instruction is
be specified separately using instructions like:
fetched from the instruction cache, then the doubly
FMPY D4,D5,DO FADD D0,Dl X:(RO)+,D4 Y:(R4)+,D5 cross-hatched operation (the instruction fetch) is not
The contents of D4 and D5 are multiplied and stored required, and only one cycle is consumed. Again, un-
in DO. Meanwhile, the previous contents of DO (not the like the DSP56001, the multiplication and addition are
result of the FMPY) are added to D1. In addition, the specified as separate parallel operations.
two data moves occur simultaneously, affecting the val- Time-stationary coding has a number of advantages.
ues of the D4 and D5 registers for subsequent instruc- First, the timing of a program is clearer. With interlock-
tions. In effect, the programmer explicitly fashions the ing, it is difficult to determine exactly how many cycles
pipeline by specifying the activity in each stage of the an instruction will consume because it depends on the
pipeline. neighboring instructions. Second, interrupts can be
Compare this instruction with the 320C30 parallel mul- much more efficient. Since the programmer has explicit
tiply and add, MPYF3 l( ADDF3 (see Example 18 of Part I).
In the 320C30, the operands are fully specified in the in-
struction that uses then. In the 96002, the operands are
specified (memory addresses given) in an instruction pre-
ceding the arithmetic instruction. Nonetheless, by per-
mitting parallel instructions, the 320C30 has introduced
an element of time-stationary coding.
Example 6
The NEC 77230 is similar to the DSP96002 in that mul-
tiply, add, and move instructions are specified in one
instruction. interestingly, the assembler syntax for such
instructions attempts to mimic that of processors with
hidden pipelining. An example of a parallel multiply

8 iEEE ASSP MAGAZINE JANUARY 1989

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
control over the pipeline, there is no need to flush the this tvpe can be issued once per instruction cycle, so
pipe prior to invoking the interrupt. One consequence of only a single cycle i s consumed. Instructions like this
this is the possibility of very fast interrupts. one can be used to implement an FIR filter in one in-
Example 8 struction cycle per tap. In fact, careful examination of
the reservation table reveals that the hardware resources
The DSP56001 and 96002 have a fast interrupt, which listed are 100 percent utilized by such a program if the
takes exactly two instruction cycles. It can be used to number of filter taps i s large enough.
grab input data and put it in a buffer, for example. Suppose the result is to be read from memory in a sub-
2.3. Data-Stationary Coding. sequent instruction and used as an operand. An in-
Time-stationary coding resembles microcode in that struction accomplishing this read would have to be
fields of the instruction specify operations in different fetched in the fourth cycle after the multiply and add
parts of the architecture. While it is easy to grow accus- instruction. In other words, the three instructions
tomed to it, it i s more natural to think of our algorithms fetched immediately after the multiply and add instruc-
in a data-stationary way. In data-stationary coding, a tion cannot read the result from memory because it has
single instruction specifies all of the operations per- not yet been written to memory when they fetch their
formed on a set of operands from memory. In other operands. The pertinent restrictions, evident in Figure 9,
words, the instruction specifies what happens to that are summarized as follows:
data, rather than specifying what happens at a particular When an accumulator an i s used as an operand to
time in the hardware. A major consequence i s that the re- the multiplier, the value of the accumulator is that
sults of the instruction may not be immediately available established three instructions earlier.
in the subsequent instructions. When a result i s written to memory, the updated
Example 9. value of the memory location cannot be accessed
until four instructions later.
The most dramatic examples of data-stationary coding
are the AT&T DSP32 and 32C. A typical instruction is: Although results are not ready in the next instruction,
data-stationary coding i s no less efficient than time-
r 5 + + = a l= a O + ’ r 7 * ” r lO + + r l 7 stationary coding. In time-stationary coding, to specify a
The address registers r 7 and r 1 0 specify the two multiply, accumulate, and store operation on a pair of
operands for the multiplier. The register r 17 specifies operands requires several instructions, and the total time
the post-auto-increment for r 1 0. The product i s added to completion is the same as with data-stationary coding,
to a 0 and result stored in a 1 and in the memory loca- assuming a similar hardware organization. In time-
tion specified by r 5 . The instruction is easy to read stationary coding, other operations proceed in parallel,
and understand, but it should be obvious that all these specified in unused fields of the multiply and accumulate
operations cannot be finished within one instruction instructions. In data-stationary coding, other operations
cycle. The timing of the instruction, shown in Figure 9, proceed in parallel specified by neighboring instructions.
actually covers six instruction cycles. An instruction of Fast interrupts are more difficult with data-stationary

JANUARY 1989 IEEE ASSP MAGAZINE 9

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
than with time-stationary coding, but are nonetheless branches, it is important to use the low-overhead looping
possible. capability of the processors for tight inner loops, rather
Example 10 than using branch instructions.

The DSP32C has a three-cycle quick interrupt. To accom- Example 15


plish this, the chip designers inserted a second set of The 56001 and 96001 have the best developed low-
pipeline registers that “shadow“ the main set, storing overhead looping capability. Any number of instructions
the processor state when an interrupt occurs. Roughly may be included inside a loop, loops are interruptible,
400 bits are stored. Nested interrupts are not possible, of and loops may be nested. The assembler syntax i s
course, since there i s only one set of shadow registers. straightforward:
2.4 Branching. DO 10,END
One difficulty with pipelining that we have thus far loop body
ignored concerns branching, particularly conditional END
branching. Several problems conspire to make it difficult Another technique used to avoid the inefficiencies of
to achieve efficient branching. conditional branches i s conditional instructions. These
are operations other than branches that are conditionally
There may not be sufficient time between instruc-
executed. For example, a conditional write instruction
tion fetches to decode a branch instruction before can often be used to avoid a conditional branch.
the next instruction is fetched.
If the program address space is large, the destination
3. THE FUTURE
address may not fit i n an instruction word, so a
second fetch from the instruction memory may be As with all microelectronics, programmable DSPs have
required. Alternatives are paging and PC-relative evolved considerably in the last ten years. It is easy to ex-
addressing. trapolate the current trends and predict processors with
In the case of conditional branching, the fetch of the more memory, faster MAC times, more I/O flexibility and
next instruction cannot occur before the condition bandwidth, etc. But such VLSI-driven improvement is by
codes in the ALU can be tested. no means the only visible trend.

Example 11 3.1. The Market.

In the TMS32010/20/C25, in order to hide the pipelining The market for programmable DSPs remains limited to
from the programmer, branch instructions require specialized products with relatively low volume (with the
several cycles to execute, where the exact number exception of modems and consumer products like the
depends on the system configuration. In the case of an “Julie” doll by Worlds of Wonder). However, this is likely
unconditional branch, the extra cycle i s needed to fetch to change dramatically in the near future. Programmable
a destination address from the program memory. In DSPs are likely to become standard peripherals in per-
the case of a conditional branch, the ALU condition sonal computers and workstations. The standard mi-
codes can be tested while the fetch of the destination croprocessor used now will continue to handle operating
address proceeds. system tasks and interactive applications, but the DSP will
handle real-time and compute intensive tasks. In prin-
Example 12 ciple, the same board with one (or a few) programmable
In the DSP16, unconditional branches consume two DSPs can be used as a modem, a general purpose number-
cycles and conditional branches consume three. cruncher, a graphics processor, a speech and music syn-
Example 13 thesizer, a speech recognizer, a music analyzer, a digital
audio processor, and a telephone message processor, in-
In the DSP32 and DSP32C, when any control group cluding voice store-and-forward. Such a product would
instruction (if, call, return, goto) i s executed, the in- obviously enhance the capabilities of today’s worksta-
struction immediately following i s also executed before tions and PCs, and would broaden the market for DSPs.
the branch occurs. This i s called a delayed branch. For
conditional branches based on the result of a data arith- 3.2. Parallelism.
metic (DA) operation, the condition tested will be es- Many applications have such stringent real-time con-
tablished by the last DA instruction four instructions straints that multiple DSPs must be used in concert. Sur-
prior to the test. It is evident from Figure 9 that the con- prisingly, very little thought or effort has historically been
ditions on the adder cannot be tested in time to affect put into designing DSPs for parallel computation. There
any instruction earlier than four instructions later. are few features, in hardware or software, to ease the
Example 14 task of synchronizing processors or accessing shared re-
sources. Fortunately, this i s changing. For example, sev-
The TMS320C30 has both delayed and multi-cycle eral newer processors have controllable wait-states for
branches, so the programmer can choose. external memory accesses. This i s invaluable for access to
Because of the inefficiencies of multi-cycle and delayed shared memory where the access may have to be delayed

10 IEEE ASSP MAGAZINE JANUARY 1989

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
due to contention. In addition, most DSPs have extra floating-point DSPs looks promising, however [Har881
pins that can be tested in software. These pins can be [Sim88]. Regardless of their efficiency, the C compilers
used to synchronize multiple processors. The TMS320C30 will inevitably be used for large applications of the DSPs,
has specialized instructions for doing this; TI calls it a which are not practical to code by hand, such as graphics.
hardware interlock capability. Motorola facilitates the de- For digital signal processing, it is doubtful that good C
sign of multiprocessor systems with the dual expansion compilers alone are the complete solution. Higher level
ports in the DSP96002. design environments are being constructed to permit
All of these are small steps, however. An essential capa- rapid prototyping (for algorithm development) and effi-
bility that i s almost totally lacking i s software simulators cient code generation (for deployment in a competitive
capable of simulating multiple-DSP systems. System de- marketplace). The most promising systems under devel-
signers must build first, test later. A notable exception is opment are based on block-diagram programming, in
Motorola, which supplies a simulator in the form of sub- which the user graphically constructs a block diagram of
routines, which can be called from user-written code. the algorithm. The user can use standard blocks from a
Each call to such a subroutine emulates the state change supplied library, or define new blocks, possibly even in
of a processor i n one clock cycle. A system designer C. Burr-Brown has already demonstrated a preliminary
planning to use more than one DSP56001 can write a C code generator for the AT&T DSP32 that begins with a
program that emulates the interconnection of the DSPs, high-level block-diagram description of the algorithm. It
shared memory, busses, and whatever other hardware is uses the same interface as their DSPlay signal processing
used (assuming the designer is willing to write emulation simulator. NEC is also known to be developing a system
code for this other hardware). At Berkeley, we are integrat- for the 77230. At Berkeley, we are developing a block-
ing Motorola's callable simulator into a general-purpose diagram programming environment called Gabriel that
hardware simulator from Stanford called Thor [Tho861 in systematically manages real-time constraints, changes
order to get a clean user interface for designing and simu- in sample rates, recurrences (feedback), conditionals,
lating parallel DSP systems. and iteration, and is capable of generating code for mul-
A more radical approach to parallel DSPs has been pro- tiple processors [Lee87bl. A Gabriel screen i s shown
posed by NEC with the introduction of the pPD7281, a in Figure 10. This system is intended to be retargettable,
data flow machine for image processing. This chip may and we have demonstrated its basic capabilities for the
be simply ahead of its time, since it has not achieved DSP56001 and DSP32.
wide acceptance. Block-diagrams have two important advantages. First,
3.3 Software.
they are a natural description of many DSP algorithms.
Second, they can potentially be automatically partitioned
One of the main impediments to widespread use of for execution on parallel processors [Lee87al. The user
DSPs is that they remain difficult to use compared with need not know the details of the architecture, or even
other microporocessors. Products take years to develop, the number or type of DSPs. Block-diagram languages
and programs take months to write even though the final f i t the data-flow model of computation, about which
code can often be stored in less than 1K words of pro- considerable theory has been developed. Generalizing
gram ROM. There are several reasons for this difficulty: these techniques to get the full expressive power of a
Although the performance is impressive, today's programming language, however, i s still a challenging re-
DSPs are barely fast enough for many real-time ap- search area.
plications. Programs must be tuned by hand to meet 3.4 Simpler Processors.
speed constraints.
On-chip memories are small, and expansion beyond The dominant trend in DSPs is towards complexity, not
the chip boundaries is practically limited to only one simplicity. Every new device has features that the previ-
(sometimes two) of the memory banks. Further- ous ones lacked, such as floating point, DMA, vectored
more, off-chip memories that do not slow down the interrupts, bit-reversed addressing, zero-overhead loop-
processor must be fast, and hence are expensive. ing, and more extensive I/O. With all these features,
Programs must be hand tuned to avoid squandering DSPs are starting to tread on the turf of microproces-
memory. sors. Unfortunately, this trend ignores the market that
Compounding the above problems, DSPs often spurred the development of DSPs i n the first place,
compete with custom circuits in fiercely competitive which required arithmetic performance near the limits of
marketplaces, such as in voiceband data modems. what current technology could supply. A market exists
Programs must be hand tuned to minimize the over- for simple and fast DSPs. Although many manufacturers
all hardware requirements of the systems. appear to be moving away from this market, some are
embracing it. For example, the Hitachi DSPi and AT&T
One possible solution to the above problems is a good DSP16A are high speed chips with more limited function-
optimizing compiler. Some C compilers have appeared ality than the current generation of floating-point DSPs.
for some DSPs, but so far they do not appear to generate
efficient enough code to meet the above constraints. Op- 3.5 Semi-Custom Processors.
timizing C compilers for the forthcoming generation of Many DSPs can be purchased in two versions, one with

JANUARY 1989 IEEE ASSP MAGAZINE 11

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
program RAM, and the other with mask-programmedpro- Remove I/O facilities when they are not used, such
gram ROM. A typical development uses the first version as DMA controllers.
for code development and migrates to the second version Customize barrel shifters to only perform the shifts
when ready for production. A DSP with mask-programmed actually used in the program.
ROM can be considered an application-specific IC. Customize the size of the register file.
Of course, the contents of the program memory may Eliminate bit-reversed or indexed addressing, if it i s
not be the only feature of the DSP that the user wishes to not used.
customize. It would be useful, for example, to customize Customize or eliminate the instruction cache, de-
the sizes of the memories. VLSl real-estate could be freed pending on whether it is required to meet real-time
for this purpose by eliminating parts of the DSP that are
not used. Possibilities include: . constraints.
Customize the size of the address space, and hence
the width of registers and busses and the number of
Trim the arithmetic word width to what is actually
needed. pins.
Remove the multiplier for low-speed applications, A user would develop the application using a high-level
or applications that make little use of it, and replace description such as C, a block diagram language, or some
with shift-and-add code. other language, and given a real-time constraint, a com-

12 IEEE ASSP MAGAZINE JANUARY 1989

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
piler would automatically determine the required architec- Then partition the register set among these applications,
ture parameters. and write code for each application, ignoring pipeline
Automated layout programs have been demonstrated hazards. Then interleave the code so that pipeline haz-
that are capable, in principle, of generating layouts that ards become irrelevant because sufficient time passes be-
are parametrized in these ways. For example, a system tween any two instructions in one stream for results to be
called Lager that has many of these capabilities is under valid. The DSP32 conveniently provides a relatively large
development at Berkeley [Pop85]. number of address registers (15) and accumulator regis-
The idea of customizing an existing architecture has its ters (4) so that such partitioning is viable.
limitations. An alternative approach is to automatically The main advantage of the above strategy i s that the
synthesize an architecture well suited in every way to the programmer can ignore the pipeline, but the architecture
application. This approach appears to be most promising does not suffer the compromises that result from hiding
for applications with extremely high performance require- the pipelining. However, a serious problem remains.
ments and relatively low complexity, such as video-rate Suppose that one of the three or four interleaved instruc-
algorithms. The Cathedral project at KUL (Katholieke Uni- tion streams requires a branch. Unfortunately, there is
versiteit Leuven) is an example of a research effort aimed only one program counter in the DSP32, so all instruction
in this direction [Cat881. streams must branch together. A simple solution i s to in-
Although automatic layout has improved dramatically troduce multiple PCs, one for each instruction stream.
in recent years, there are still many difficult problems This technique is called pipeline interleaving.
that remain to be solved before these techniques are Pipeline interleaving transforms a single DSP with data-
fully practical. But the progress thus far is encouraging. stationary code and pipeline hazards into multiple pro-
cessors (called processor slices) that have no pipeline
3.6. Pipeline Interleaving.
hazards and actually share the same hardware, except
As discussed above, pipelining introduces a special set registers. A pipeline interleaved architecture that can be
of difficulties that are either born by the architecture de- built with conservative technology is described in detail
signer or by the programmer. Three techniques for dealing in [Lee87c].
with pipelining are described above, hiding it, time- Although pipeline interleaving removes pipeline haz-
stationary coding, and data-stationary coding. Hiding the ards, it introduces new problems. The algorithm must be
pipelining completely requires compromising perfor- partitioned for parallel computation. One proposal is to
mance. Time-stationary coding resembles microcode and use the data flow properties of block-diagram languages,
can be difficult to generate (for either humans or compil- as described in [Lee87d] to automatically (at compile
ers). Data-stationary coding has artifacts (called hazards) time) partition and synchronize the task.
such as delayed data validity that again make code
difficult to write. A fourth solution that has not yet been 4. ACKNOWLEDGEMENTS
implemented in any commercial DSP i s pipeline inter-
leaving [Lee87cl. The author gratefully acknowledges the careful reading
and thoughtful comments of Jim Boddie, Craig Garen,
The idea is an old one, dating back to the 1960s. Con-
and John Hartung from AT&T, Philip Goldworth and T. J.
sider the following strategy for writing code on a processor
such as the AT&T DSP32C which uses data-stationary cod- Shan of Fujistu, Kenji Kaneko f r o m Hitachi, Bryant
Wilder, Kevin Kloker, and Garth Hillman from Motorola,
ing and extensive pipelining. Instead of writing a single
in-line instruction stream, alternate instructions from Takao Nishitani of NEC, and Panos Papamichalis and Ray
Simar from Texas Instruments. Other helpful suggestions
three o r f o u r reasonably i n d e p e n d e n t i n s t r u c t i o n
were made by Bob Owen. Most importantly, the editor in
streams, as illustrated in Figure 11. In other words, begin
chief of the ASSP Magazine, Tom Alexander, took a great
by identifying operations that can proceed in parallel.
interest in this paper and was extremely helpful. Any re-
maining errors are entirely the fault of the author.

REFERENCES
[Cat881 F. Catthor, J. Rabaey, G. Goossens, J. L. Van
Meerbergen, R. Jain, H . J. De Man, and J. Vandewalle,
“Arch ite ct u r a I Strategies f o r an A p p I ica t i o n - S pec if ic
Synchronous Multiprocessor Environment,” /€E€ Trans.
ASSP, February 1988, 36(2).
[Hart381 J. Hartung, S. L. Gay, and S. G. Haigh, “A Practical
C Language Compiler/Optimizer for Real-Time Imple-
mentation on a Family of Floating Point DSPs,” Pro-
ceedings of ICASSP, pp 1674-1677, New York, April, 1988.
[Lee87a] E. A. Lee and D. G. Messerschmitt, ”Static
Scheduling of Synchronous Data Flow Programs For

JANUARY 1989 IEEE ASSP MAGAZINE 13

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.
Digital Signal Processing,” /€E€ Trans. on Computers,
Edward A. lee has been an assistant professor
January 1987, C-36(2). in the Electrical Engineering and Computer
[Lee87b] E. A. Lee and D. G. Messerschmitt,“Synchronous Science Department at U.C. Berkeley since
Data Flow,” /€EProceedings, September, 1987. July, 1986. His research activities include par-
[Lee87c] E. A. Lee and D. G . Messerschmitt, “Pipeline In- allel computation, architecture and software
techniques for programmable DSPs, design
t e r Ie aved Pr o gram mab Ie DS Ps : Arch it ect u re, ” / € € €
environments for real-time software devel-
Trans. on ASSP, September, 1987 ASSP-35(9). opment, and digital communication. He has
[Lee87d] E. A. Lee and D.G . Messesrschmitt, “Pipeline In- taught short courses on the architecture of
terleaved Programmable DSPs: Synchronous Data Flow programmable DSPs and telecommunications
Programming,” /€€E Trans. on ASSP, September, 1987 applications of programmable DSPs. He was a recipient of the 1987
NSF Presidential Young Investigator award, an IBM faculty develop-
ASSP-35(9). ment award, and the 1986 Sakrison prize at U.C. Berkeley for the
[Pop851 S. Pope, J. Rabaey, and R. W. Brodersen, “An In- best thesis in Electrical Engineering. H e i s co-author of “Digital Com-
tegrated Automatic Layout Generation System for DSP munication”, with D. G . Messerschmitt, KIuwer Academic Press,
Circuits,” / E € € Trans. on Computer-aided Design, July 1988. His B.S. degree i s from Yale University (1979), his masters (S.M.)
from MIT (1981), and his PhD from U.C. Berkeley (1986). From 1979
1985 CAD-4(3) pp. 285-296.
to 1982 he was a member of technical staff at Bell Labs in Holmdel,
[Sim88] R. Simar Jr. and A. Davis, ”The Application o f New Jersey, in the Advanced Data Communications Laboratory, where
High-Level Language to Single-Chip Digital Signal Pro- he did extensive work with early programmable DSPs, and explora-
cessors,” Proceedings of ICASSP, pp 1678-1681, New tory work in voiceband data modem techniques and simultaneous
York, April, 1988. voice and data transmission.
[Tho861 VLSIKAD Group, ”Thor Tutorial,“ Stanford Uni-
versity, Stanford, CA, 1986.

14 IEEE ASSP MAGAZINE JANUARY 1989

Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on September 18,2024 at 17:43:48 UTC from IEEE Xplore. Restrictions apply.

You might also like