Login

nishant · 08-16-2017, 10:20 PM

[attachment=1645]

ABSTRACT
The Tiger SHARC processor is the newest and most power member of

this family which incorporates many mechanisms like SIMD, VLIW and

short vector memory access in a single processor. This is the first

time that all these have been combined in a real time processor.
The TigerSHARC DSP is an ultra high performance static superscalar

architecture that optimized for tele-communications infrastructure

and other computationally demanding applications.
The unique architecture combines elements of RISC, VLIW and

standard DSP processors to provide native support for 8, 16,and 32

-bit fixed, as well as floating point data types on single chip.

Large on-chip memory, extremely high internal and external

bandwidths and dual compute blocks provide the necessary

capabilities to handle a vast array of computationally demanding,

large signal processing tasks.
Contents
1. INTRODUCTION
1.1. Analog and digital signals
1.2. Signal processing
1.3. Digital signal processing
1.4. Development of DSP
1.5. Digital signal processors
2. ARCHITECTURE OF DIGITAL SIGNAL PROCESSORS
2.1. Von Neumann Architecture
2.2. Harvard Architecture
2.3. Super Harvard Architecture
3. THE TIGER SHARC PROCESSOR
3.1. Features
3.2. Benefits
3.3. Description
3.4. Tiger SHARC Processor families
3.5. Functional Block Diagram
3.5.1. Architectural Features
3.5.2. Adapts to evolving signal processing demands
3.5.3. Multiprocessor, general-purpose processing
3.5.4. Instruction Parallelism and SIMD operations
3.5.5. Independent, Parallel Computation Blocks
3.5.6. CLU (Communications Logic Unit)
3.5.7. Integer ALUs
3.5.8. Tiger SHARC memory Integration
3.5.9. Program Sequencer
3.5.10. Flexible Integrated Memory
3.5.11 DMA Controller
3.5.12. Link Ports
3.5.13. External Port
4. APPLICATIONS
5. ADVANTAGES
6. CONCLUSION
7. REFERANCES
1. INTRODUCTION
1.1 Analog and digital signals
In many cases, the signal of interest is initially in the form of

an analog electrical voltage or current, produced for example by a

microphone or some other type of transducer. An analog signal must

be converted into digital form before DSP techniques can be

applied. An analog electrical voltage signal, for example, can be

digitized using an electronic circuit called an analog-to-digital

converter or ADC. This generates a digital output as a stream of

binary numbers whose values represent the electrical voltage input

to the device at each sampling instant.
1.2 Signal processing
Signals commonly need to be processed in a variety of ways. For

example, the output signal from a transducer may well be

contaminated with unwanted electrical "noise". The electrodes

attached to a patient's chest when an ECG is taken measure tiny

electrical voltage changes due to the activity of the heart and

other muscles. The signal is often strongly affected by "mains

pickup" due to electrical interference from the mains supply.

Processing the signal using a filter circuit can remove or at least

reduce the unwanted part of the signal. Increasingly nowadays, the

filtering of signals to improve signal quality or to extract

important information is done by DSP techniques rather than by

analog electronics.
1.3 Digital Signal Processing
Digital signal processing (DSP) is the study of signals in a

digital representation and the processing methods of these signals.

DSP and analog signal processing are subfields of signal processing

Digital Signal Processing is carried out
by mathematical operations. In comparison, word processing and

similar programs merely rearrange stored data. This means that

computers designed for business and other general applications are

not optimized for algorithms such as digital filtering and Fourier

analysis. Digital Signal Processors are microprocessors

specifically designed to handle Digital Signal Processing tasks.

These devices have seen tremendous growth in the last decade,

finding use in everything from cellular telephones to advanced

scientific instruments. In fact, hardware engineers use "DSP" to

mean Digital Signal Processor, just as algorithm developers use

"DSP" to mean Digital Signal Processing
1.4 Development of DSP
The development of digital signal processing dates from the 1960's

with the use of mainframe digital computers for number-crunching

applications such as the Fast Fourier Transform (FFT), which allows

the frequency spectrum of a signal to be computed rapidly. These

techniques were not widely used at that time, because suitable

computing equipment was generally available only in universities

and other scientific research institutions.
1.5 Digital Signal Processors (DSPs)
DSP processors are microprocessors designed to perform digital

signal processing- the mathematical manipulation of digitally

represented signals. The introduction of the microprocessor in the

late 1970's and early 1980's made it possible for DSP techniques to

be used in a much wider range of applications. However, general-

purpose microprocessors such as the Intel x86 family are not

ideally suited to the numerically-intensive requirements of DSP,

and during the 1980's the increasing importance of DSP led several

major electronics manufacturers
(such as Texas Instruments, Analog Devices and Motorola) to develop

Digital Signal Processor chips - specialised microprocessors with

architectures designed specifically for the types of operations

required in digital signal processing. (Note that the acronym DSP

can variously mean Digital Signal Processing, the term used for a

wide range of techniques for processing signals digitally, or

Digital Signal Processor, a specialised type of microprocessor

chip). Like a general-purpose microprocessor, a DSP is a

programmable device, with its own native instruction code. DSP

chips are capable of carrying out millions of floating point

operations per second, and like their better-known general-purpose

cousins, faster and more powerful versions are continually being

introduced. DSPs can also be embedded within complex "system-on-

chip" devices, often containing both analog and digital circuitry.
Advantage over other Microprocessors
Single cycle multiply-accumulate operations(MAC)
Real time performance
Flexibility and Reliability
Increased system performance
Reduced cost
Harvard architecture
2. Architecture of the Digital Signal Processor
One of the biggest bottlenecks in executing DSP algorithms is

transferring information to and from memory. This includes data,

such as samples from the input signal and the filter coefficients,

as well as program instructions, the binary codes that go into the

program sequencer. For example, suppose we need to multiply two

numbers that reside somewhere in memory. To do this, we must fetch

three binary values from memory, the numbers to be multiplied, plus

the program instruction describing what to do.
2.1 Von Neumann architecture
Figure 1(a).shows how this seemingly simple task is done in a

traditional microprocessor. This is often called a Von Neumann

architecture, after the brilliant American mathematician John Von

Neumann (1903-1957). Von Neumann guided the mathematics of many

important discoveries of the early twentieth century. His many

achievements include: developing the concept of a stored program

computer, formalizing the mathematics of quantum mechanics, and

work on the atomic bomb.
As shown in (a), a Von Neumann architecture contains a single

memory and a single bus for transferring data into and out of the

central processing unit (CPU). Multiplying two numbers requires at

least three clock cycles, one to transfer each of the three numbers

over the bus from the memory to the CPU. We don't count the time to

transfer the result back to memory, because we assume that it

remains in the CPU for additional manipulation (such as the sum of

products in an FIR filter). The Von Neumann design is quite

satisfactory when you are content to execute all of the required

tasks in serial. In fact, most computers today are of the Von

Neumann
design. When an instruction is processed in such a processor, units

of the processor not involved at each instruction phase wait idly

until control is passed on to them. Increase in processor speed is

achieved by making the individual units operate faster, but there

is a limit on how fast they can be made to operate. So we need

other architectures when very fast processing is required, and we

are willing to pay the price of increased complexity.
2.2 Harvard architecture
This leads us to the Harvard architecture, shown in (b). This is

named for the work done at Harvard University in the 1940s under

the leadership of Howard Aiken (1900-1973). As shown in this

illustration, Aiken insisted on separate memories for data and

program instructions, with separate buses for each. Since the buses

operate independently, program instructions and data can be fetched

at the same time, improving the speed over the single bus design.

Most present day DSPs use this dual bus architecture.
2.3 Super Harvard Architecture(SHARC)
Figure © illustrates the next level of sophistication, the Super

Harvard Architecture. This term was coined by Analog Devices to

describe the internal operation of their ADSP-2106x and new ADSP-

211xx families of Digital Signal Processors. These are called

SHARC DSPs, a contraction of the longer term, Super Harvard

ARChitecture. The idea is to build upon the Harvard architecture by

adding features to improve the throughput. While the SHARC DSPs are

optimized in dozens of ways, two areas are important enough to be

included in Fig. ©: an instruction cache, and an I/O controller.
A handicap of the basic Harvard design is that the data memory bus

is busier than the program memory bus. When two numbers are

multiplied, two binary values (the numbers) must be passed over the

data memory bus, while only one binary value (the program

instruction) is passed over the program memory bus. To improve upon

this situation, we start by relocating part of the "data" to

program memory. For instance, we might place the filter

coefficients in program memory, while keeping the input signal in

data memory. (This relocated data is called "secondary data" in the

illustration). At first glance, this doesn't seem to help the

situation; now we must transfer one value over the data memory bus

(the input signal sample), but two values over the program memory

bus (the program instruction and the coefficient). In fact, if we

were executing random instructions, this situation would be no

better at all.
However, DSP algorithms generally spend most of their execution

time in loops. This means that the same set of program instructions

will continually pass from program memory to the CPU. The Super

Harvard architecture takes advantage of this situation by including

an instruction cache in the CPU. This is a small memory that

contains about 32 of the most recent program instructions. The

first time through a loop, the program instructions must be passed

over the program memory bus. This results in slower operation

because of the conflict with the coefficients that must also be

fetched along this path. However, on additional executions of the

loop, the program instructions can be pulled from the instruction

cache. This means that all of the memory to CPU information

transfers can be accomplished in a single cycle: the sample from

the input signal comes over the data memory bus, the coefficient

comes over the program memory bus, and the program instruction

comes from the instruction cache. In the jargon of the field, this

efficient transfer of data is called a high memory-access

bandwidth.
Figure 2. presents a more detailed view of the SHARC architecture,

showing the I/O controller connected to data memory. This is how

the signals enter and exit the system. For instance, the SHARC DSPs

provides both serial and parallel communications ports. These are

extremely high speed connections. For example, at a 40 MHz clock

speed, there are two serial ports that operate at 40 Mbits/second

each, while six parallel ports each provide a 40 Mbytes/second data

transfer. When all six parallel ports are used together, the data

transfer rate is an incredible 240 Mbytes/second.
Just as important, dedicated hardware allows these data streams to

be transferred directly into memory (Direct Memory Access, or DMA),

without having to pass through the CPU's registers. The main buses

(program memory bus and data memory bus) are also accessible from

outside the chip, providing an additional interface to off-chip

memory and peripherals. This allows the SHARC DSPs to use a four

Gigaword (16 Gbyte) memory, accessible at 40 Mwords/second (160

Mbytes/second), for 32 bit data.

This type of high speed I/O is a key characteristic of DSPs. The

overriding goal is to move the data in, perform the math, and move

the data out before the next sample is available. Everything else

is secondary. Some DSPs have on-board analog-to-digital and

digital-to-analog converters, a feature called mixed signal.

However, all DSPs can interface with external converters through

serial or parallel ports.
At the top of the diagram are two blocks labeled Data Address

Generator (DAG), one for each of the two memories. These control

the addresses sent to the program and data memories, specifying

where the information is to be read from or written to. In simpler

microprocessors this task is handled as an inherent part of the

program sequencer, and is quite transparent to the programmer.

However, DSPs are designed to operate with circular buffers, and

benefit from the extra hardware to manage them efficiently. This

avoids needing to use precious CPU clock cycles to keep track of

how the data are stored. For instance, in the SHARC DSPs, each of

the two DAGs can control eight circular buffers. This means that

each DAG holds 32 variables (4 per buffer), plus the required

logic.
Some DSP algorithms are best carried out in stages. For instance,

IIR filters are more stable if implemented as a cascade of biquads

(a stage containing two poles and up to two zeros). Multiple stages

require multiple circular buffers for the fastest operation. The

DAGs in the SHARC DSPs are also designed to efficiently carry out

the Fast Fourier transform. In this mode, the DAGs are configured

to generate bit-reversed addresses into the circular buffers, a

necessary part of the FFT algorithm. In addition, an abundance of

circular buffers greatly simplifies DSP code generation- both for

the human programmer as well as high-level language compilers, such

as C.
The data register section of the CPU is used in the same way as in

traditional microprocessors. In the ADSP-2106x SHARC DSPs, there

are 16 general purpose registers of 40 bits each. These can hold

intermediate calculations, prepare data for the math processor,

serve as a buffer for data transfer, hold flags for program

control, and so on. If needed, these registers can also be used to

control loops and counters; however, the SHARC DSPs have extra

hardware registers to carry out many of these functions.
The math processing is broken into three sections, a multiplier,

an arithmetic logic unit (ALU), and a barrel shifter. The

multiplier takes the values from two registers, multiplies them,

and places the result into another register. The ALU performs

addition, subtraction, absolute value, logical operations (AND, OR,

XOR, NOT), conversion between fixed and floating point formats, and

similar functions. Elementary binary operations are carried out by

the barrel shifter, such as shifting, rotating, extracting and

depositing segments, and so on. A powerful feature of the SHARC

family is that the multiplier and the ALU can be accessed in

parallel. In a single clock cycle, data from registers 0-7 can be

passed to the multiplier, data from registers 8-15 can be passed to

the ALU, and the two results returned to any of the 16 registers.
There are also many important features of the SHARC family

architecture that aren't shown in this simplified illustration. For

instance, an 80 bit accumulator is built into the multiplier to

reduce the round-off error associated with multiple fixed-point

math operations. Another interesting feature is the use of shadow

registers for all the CPU's key registers. These are duplicate

registers that can be switched with their counterparts in a single

clock cycle. They are used for fast context switching, the ability

to handle interrupts quickly. When an interrupt occurs in

traditional microprocessors, all the internal data must be saved

before the interrupt can be handled. This usually involves pushing

all of the occupied registers onto the stack, one at a time. In

comparison, an interrupt in the SHARC family is handled by moving

the internal data into the shadow registers in a single clock

cycle. When the interrupt routine is completed, the registers are

just as quickly restored. This feature allows step 4 on our list

(managing the sample-ready interrupt) to be handled very quickly

and efficiently.

SHARC has 32/42 bit floating and fixed point core.DMA controller

and duel ported SRAM to move data into and out of memory without

wasting core cycles. It has high performance computation unit. It

has four bus performances. They include fetch next instruction,

access 2 data values, performs DMA for I/O device.
3. The TigerSHARC Processor
Tiger sharc processors provide the highest performance density for

multiplexing applications with peak performance and well above a

billion floating point operations per second. One Gbyte/sec

multiprocessing page link ports gluelessly multiple Tiger sharc

processors, and versions are available with up to 24 Mbits of

integrated, on chip memory.
Keeping pace with the accelerating march of architectural

innovation in DSPs, Analog devices (ADI) unveiled its third

generation floating point DSP,TIGERSHARC.
There architect Jose Fridman described a complex, high performance

VLIW-based design incorporating unusually extensive single-

instruction, multiple data (SIMD) capabilities. Unlike its

predecessors, which are primarily aimed at application demanding

floating point arithmetic, TigerSHARc has excellent fixed point

capabilities and is better described as 16-bit fixed point DSP with

floating point support than as a floating point DSP.
The TigerSHARC Processor provides leading-edge system performance

while keeping the highest possible flexibility in software and

hardware development.
The TigerSHARC Processor's balanced architecture utilizes

characteristics of RISC, VLIW, and DSP to provide a flexible, "all

software" approach that adds capacity while reducing costs and

bills of material.
3.1 FEATURES
Static Superscalar Architecture
Two 32 bit MACs per cycle with 80-bit accumulation
Eight 16-bit MACs per cycle with 40-bit accumulation
Two 16-bit complex MACs per cycle
Add-subtract instruction and bit reversal in hardware for

FFTs
64-bit generalised bit manipulation unit
Two billion MACs per second at 250 MHz
2 billion 16-bit MACs
500 million 32-bit MACs
12 GB/s of internal memory bandwidth for data and code
500 MHz, 2.0 ns instruction cycle rate.
12 Mbits of internal on-chip DRAM memory
Dual computation blocks, each containing an ALU,a

multiplier, a shifter and a register file
Dual integer ALUs, providing and data addressing and

pointer manipulation
Single precision IEE 32-bit and extended bit precision 40

-bit floating point data formats and 8-,16-,32- and 64 bit fixed

point data formats.
Integrated I/O include 14 channel DMA controller, external

port,progamable flag pins, two timers and timer expired pin for

system integration.
3.2 Key Benefits
Provides high performance static Superscalar DSP

operations, optimized for large, demanding multiprocessor DSP

applications
Performs exceptionally well on DSP algorithm and I/O

benchmarks
Supports low overhead DMA transfers between Internal

memory, external memory, memory mapped peripherals, host processors

and other DSPs
Eases programming through extremely flexible instruction

set and high level language friendly DSP architecture
Enables scalable multiprocessing systems with low

communication overhead.
3.3 DESCRIPTION
TigerSHARC processor is an ultrahigh performance, static

superscalar processor optimized for large signal processing tasks

and communication infrastructure. The DSP combines very memory

widths with dual computation blocks-supporting floating point (IEE

32-bit and extended precision 40-bit) and fixed point (8-,16-,32

-,64- bits) processing to set a new standard of performance for

digital signal processors. The TigerSHARC static superscalar

architecture lets the DSP execute up to four instructions each

cycle, performing 24 fixed point (16-bit) operations. Four

independent 128-bit wide internal data buses, each connecting to

the six 2M bit memory banks, enable quad word data, instruction,

and I/O address and provide 28 Gbytes per second of internal memory

bandwidth.
Like its competititor Texas Instruments TMC320C64x, TigerSHARC

uses a very long instruction word (VLIW) load/store

architecture.TigerSHARC executes as any as four instructions per

cycle with its interlocking ten-stage pipeline
and dual computation blocks. Each block contains a multiplier, an

ALU, and a 64 bit shifter and can perform one 32-*32 bit or four

16-*16-bit multiply accumulates (MAC) per cycle.

TigerSHARC is aimed at telecommunications infrastructure

applications, such as cellular telephone base stations. As

illustrated in fig. the TigerSHARC architecture contains a program

control unit two computation units, two address generators memory

various peripherals and a DMA controller. With its VLIW

architecture TigerSHARC is capable of executing up to four

instructions in a single cycle, and its SIMD features enable it to

perform arithmetic operations on multiple 32-bit floating point

values or multiple 32-,16- or 8-bit fixed point values.

Each of TigerSHARC s computation units can perform two 32*32=64-

bit fixed point multiply-accumulates in a single cycle, using two

operands each made up of two concatenated 32-bit registers. Thus

using both computation units TigerSHARc can perform four 32*32=64-

bit fixed point multiply-accumulate operations in a single cycle.

Alternatively, TigerSHARC can perform two 32-bit floating point MAC

operations per cycle.
In fixed point DSP applications, the most common word width is 16

-bits.With four 16-bit fixed point elements concatenated in two 32

-bit registers, one computation unit can in a single cycle perform

four 16*16=32-bit multiply-accumulate operations (with 8 guard bits

each to avoid overflow)-twice as many as any currently available

fixed or floating point DSP can perform.
TigerSHARC uses SIMD features at two levels-two separate

computation units that each operate on SIMD operands .Fig

illustrates how the two SIMD computation units divide the registers

into different data sizes.
TigerSHARC is the first of the new wave of VLIW based DSPs to

provide extensive SIMD capabilities. This approach provides greater

parallelism than that of its Texas Instruments competitors.

On-chip memory is divided into three banks: one for soft-ware and

two for data. ADI will not disclose the amount of on-chip memory in

the first TigerSHARC devices, but we expect that the vendor will

continue to be generous with on-chip memory; the predecessor SHARC

and Hammerhead devices include 68K to 512K of on-chip memory. When

moving 64-bit or 128-bit data, TigerSHARC transfers data from

consecutive memory locations to consecutive data registers, or vice

versa. The smallest amount of data that can be transferred is 32

bits. If TigerSHARC programs use word sizes of 8 or 16 bits in a

DSP algorithm, they cannot access individual words; any load or

store will transfer at least four 8-bit or two 16-bit words. The

chip includes a data alignment buffer and a short data alignment

buffer that allow 64 or 128 bits of data to be transferred from

(but not to) any memory location aligned on a 16-bit word boundary.

TigerSHARC provides more flexibility than most processors with SIMD

features, which often require that data be aligned at memory

locations divisible by the size of the data transfer.
Data is transferred between the computation units and on-chip

memory in blocks of 32-,64-,or128-bits.When moving 64-bit or 128-

bit data,TigerSHARC transfers data from consecutive memory

locations to consecutive data registers, or vice versa. The

smallest amount of data that can be transferred is 32-bits.If

TigerSHARC programs use word size of 8 or 16 bits in a DSP

algorithm ,they cannot access individual words, any load or store

will transfer at least four 8-bit or two 16-bit words.
The chip includes a data alignment buffer and a short data

alignment buffer that allow 64 or 128 bits of data to be

transferred from (but not to be) any memory location aligned on a

16-bit word boundary.TigerSHARC provides more flexibility than most

processors with SIMD features, which often require that data be

aligned at memory locations divisible by the size of the data

transfer.
3.4 TigerSHARC Processor families

Nr. Processor Name Description Manufacturer
1 ADSP-TS101-S
ADSP-TS101S TigerSHARC DSP Analog Devices
2 ADSP-TS101S
300 MHz TigerSHARC Processor with 6 Mbit on-chip SRAM Analog

Devices
3 ADSP-TS101SAB1-000
300 MHz TigerSHARC Processor with 6 Mbit on-chip SRAM Analog

Devices
4 ADSP-TS101SAB1-100
300 MHz TigerSHARC Processor with 6 Mbit on-chip SRAM Analog

Devices
5 ADSP-TS201S
500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded DRAM

Analog Devices
6 ADSP-TS201SABP-050
500/600 MHz TigerSHARC Processor with 24 Mbit on-chip embedded DRAM

Analog Devices
7 ADSP-TS202S
500 MHz TigerSHARC Processor with 12 Mbit on-chip embedded DRAM

Analog Devices
8 ADSP-TS202SABP-050
500 MHz TigerSHARC Processor with 12 Mbit on-chip embedded DRAM

Analog Devices
9 ADSP-TS202SABP-X
TigerSHARC Embedded Processor Analog Devices
10 ADSP-TS202SABP-X
TigerSHARC Embedded Processor Analog Devices
11 ADSP-TS203S
500 MHz TigerSHARC Processor with 4 Mbit on-chip embedded DRAM

Analog Devices
12 ADSP-TS203SABP-050
500 MHz TigerSHARC Processor with 4 Mbit on-chip embedded DRAM

Analog Devices
13 ADSP-TS203SABP-X
TigerSHARC Embedded Processor Analog Devices
14 ADSP-TS203SABP-X
TigerSHARC Embedded Processor Analog Devices

3.5. FUNCTIONAL BLOCK DIAGRAM

3.5.1 Architectural Features
Flexibility without compromise the TigerSHARC Processor provides

leading-edge system performance while keeping the highest possible

flexibility in software and hardware development.
The TigerSHARC Processor's balanced architecture utilizes

characteristics of RISC, VLIW, and DSP to provide a flexible, "all

software" approach that adds capacity while reducing costs and

bills of material.
The TigerSHARC Processor is an ultra-high performance static

superscalar DSP optimized for multi-processing applications

requiring computationally demanding large signal processing tasks.

This document describes the key features of the TigerSHARC

Processor architecture that combine to offer the highest

performance, flexibility, efficiency and scalability available to

equipment manufacturers in the marketplace today
3.5.2 Adapts to evolving signal processing demands
The TigerSHARC's unique ability to process 1-, 8-, 16- and 32-bit

fixed-point as well as floating-point data types on a single chip

allows original equipment manufacturers to adapt to evolving

telecommunications standards without encountering the limitations

of traditional hardware approaches .Having the highest performance

DSP for communications infrastructure and multiprocessing

applications available, TigerSHARC allows wireless infrastructure

manufacturers to continue evolving their design to meet the needs

of their target system, while deploying a highly optimized and

effective Node B solution that will realize significant overall

cost savings.
3.5.3 Multiprocessor, general-purpose processing
The TigerSHARC Processor's balanced architecture optimizes system,

cost, power, and density. A single TigerSHARC Processor, with its

large on-chip memory, zero overhead DMA engine, large I/O

throughput, and integrated multiprocessing support, has the

necessary integration to be a complete node of a multiprocessing

system.
This enables a multiprocessor network exclusively made up of

TigerSHARCs without any expensive and power consuming external

memories or logic.
3.5.4 Instruction Parallelism and SIMD Operation
As a static superscalar DSP, the TigerSHARC Processor core can

execute simultaneously from one to four 32-bit instructions encoded

in a single instruction line. With a few exceptions, an instruction

line, whether it contains one, two, three or four 32-bit

instructions, executes with a throughput of one cycle in an eight-

deep processor pipeline. The TigerSHARC Processor has a set of

instruction parallelism rules that programmers must follow when

encoding an instruction line. In general, the selection of

instruction the DSP can execute in parallel each cycle depends on

the instruction line resources each requires and on the source and

destination of registers used. The programmer has direct control of

the three core components - the IALU, the Computation Blocks, and

the Program Sequencer.
In most cases the TigerSHARC Processor has a two-cycle execution

pipeline that is fully interlocked, so whenever a computation

result is unavailable for another operation dependent on it, stall

cycles are automatically inserted. Efficient
programming with dependency-free instructions can eliminate most

computational and memory transfer dependencies. All of the

instruction parallel rules and data dependencies are documented in

the TigerSHARC Processor User's Guide.
The TigerSHARC Processor also has the capability of supporting

single-instruction, multiple-data SIMD operations through the use

of both Computational Blocks in parallel as well as the use of SIMD

specific computations. The programmer has the option of directing

both Computation Blocks to operate on the same data (broadcast

distribution) or different data (merged distribution). In addition,

each Computation Block can execute four 16-bit or eight 8-bit SIMD

computations in parallel.
3.5.5. Independent, Parallel Computation Blocks
As mentioned above, the TigerSHARC Processor has two Computation

Blocks that can operate either independently, in parallel or as a

SIMD engine. The DSP can issue up to two compute instructions per

Computation Block per cycle, instructing the ALU, multiplier or

shifter to perform independent, simultaneous operations. The

Computation Blocks each contain four computational units, an ALU, a

multiplier, a 64-bit shifter, a CLU (ADSP-TS201S only) and a 32-bit

register file.
The 32-bit word, multi-ported register files are used for

transferring data between the computational units and data buses,

and for storing intermediate results. Instructions can access the

registers in the register file individually (word-aligned) or in

sets of two (dual-aligned) or four (quad-aligned). The ALU performs

a standard set of arithmetic operations in both fixed-point and

floating-point formats, while also performing logic operations. The

multiplier performs both fixed-point and floating-
point multiplication as well as fixed-point multiply and

accumulates. The 64-bit shifter performs logical and arithmetic

shifts, bit and bit-stream manipulation, and field deposit and

extraction.
3.5.6. CLU (Communications Logic Unit)
The CLU on the ADSP-TS201S is a 128-bit unit which houses enhanced

acceleration instructions specifically targeted at increasing the

amount of Complex Multiplies per cycle and improving the Decoding

efficiency of the TigerSHARC device. The CLU is not available on

the ADSP-TS202S and ADSP-TS203S.
3.5.7 Integer ALUs
The TigerSHARC Processor has two integer ALUs (IALUs) that provide

powerful address generation capabilities and perform many general-

purpose integer operations. Each IALU has a multi-ported 31-word

register file. As address generators, the IALUs perform immediate

or indirect (pre- and post-modify) addressing. They perform modulus

and bit-reverse operations with no constraints placed on memory

addresses for data buffer placement. Each IALU can specify either a

single, dual- or quad- word access from memory.
The TigerSHARC Processor IALUs enable implementation of circular

buffers in hardware. Circular buffers facilitate efficient

programming of delay lines and other data structures required in

digital signal processing, and they are commonly used in digital

filters and Fourier transforms. Each IALU provides registers for

four circular buffers, so applications can set up a total of eight

circular buffers. The IALUs handle address pointer wraparound

automatically, reducing overhead, increasing performance, and

simplifying implementation.
Circular buffers can start and end at any memory location. Because

the IALU's computational pipeline is one cycle deep, in most cases

integer results are available in the next cycle. Hardware (register

dependency check) causes a stall if a result is unavailable in a

given cycle.
3.5.8 TigerSHARC Memory Integration
The large on-chip memory is divided into three separate blocks of

equal size. Each block is 128-bits wide, offering the quad word

structure and four addresses for every row. For data accesses, the

processor can address one 32-bit word or two 32-bit words (long) or

four 32-bit words (quad) and transfer it to/from a single

computational unit or to both in a single processor cycle. The user

only has to care that the start addresses are either modulo two or

modulo four addresses when fetching long words and quad words. In

applications that require computing data of a delay line in which

the start address of the variable does not match the modulo

requirements, or in other applications that require unaligned data

fetches a data alignment buffer (DAB) is provided. Once the DAB is

filled, quad word fetches can be made from it.Besides the internal

memory, the TigerSHARC can access up to four giga words of memory.

The memory map is given in Figure

3.5.9 Program Sequencer
The TigerSHARC Processor Program Sequencer manages program

structure and program flow by supplying addresses to memory for

instruction fetches. Contained within the Program Sequencer, the

Instruction Alignment Buffer (IAB) caches up to five fetched

instruction lines waiting to execute. The Program Sequencer

extracts an instruction line from the IAB and distributes it to the

appropriate core component for execution. Other Program Sequencer

functions include; determining flow according to instructions such

as JUMP, CALL, RTI and RTS, decrement the loop counters, handle

hardware interrupts and using branch prediction and 128-entry

Branch Target Buffer (BTB) to reduce branch delays for efficient

execution of conditional and unconditional branch instructions.
3.5.10. Flexible Integrated Memory
The ADSP-TS20xS family has three memory variants. The ADSP-TS201S

has 24Mbits of on-chip embedded DRAM memory, divided into six

blocks of 4Mbits (128 K words X 32-bits); the ADSP-TS202S has

12Mbits of on-chip embedded DRAM memory, divided into six blocks of

2Mbits (64 K words X 32-bits); the ADSP-TS203S has 4Mbits of on-

chip embedded DRAM memory, divided into four blocks of 1Mbit (16 K

words X 32-bits). On all variants, each block can store program

memory, data memory or both, so programmers can configure the

memory to suit their specific needs. The six memory blocks connect

to the four 128-bit wide internal buses through a crossbar

connection, enabling four memory transfers in the same cycle. The

internal bus architecture of the ADSP-TS20xS family provides a

total memory bandwidth of 32 Gbytes/second, enabling the core and

I/O to access twelve 32-bit data words four 32-bit instructions per

cycle.
3.5.11. DMA Controller
The TigerSHARC Processor on-chip DMA controller, with fourteen DMA

channels, provides zero-overhead data transfers without processor

intervention. The DMA controller operates independently and

invisibly to the DSP's core, enabling DMA operations to occur while

the core continues to execute program instructions.
The DMA controller performs routine functions such as external

port block transfers, page link port transfers and AutoDMA transfers as

well as additional features such as Flyby transfers, DMA chaining

and Two-dimensional transfers.
3.5.12. Link Ports
The ADSP-TS201S and ADSP-TS202S have four full-duplex page link ports

each providing four-bit receive and four-bit transmit I/O

capability, using Low-Voltage, Differential-Signal (LVDS)

technology. With the ability to operate at a double data rate

running at 500 MHz, each page link can support up to 500 Mbytes per

second per direction, for a combined maximum throughput of 4 Gbytes

per second.
The ADSP-TS203S has two full-duplex page link ports each providing

four-bit receive and four-bit transmit I/O capability, using Low-

Voltage, Differential-Signal (LVDS) technology. With the ability to

operate at a double data rate running at 250 MHz, each page link can

support up to 500 Mbytes per second per direction, for a combined

maximum throughput of 4 Gbytes per second.
Each Link Port has its own triple-buffered quad-word input and

double-buffered quad-word output registers. The DSP's core can

write directly to a Link Port's transmit register and read from a

receive register, or the DMA controller can perform DMA transfers

through eight dedicated Link Port DMA channels.

3.5.13. External Port
The external port on TigerSHARC Processor is 64 bits wide and runs

up to 125MHz. Using the external port, up to 8 TigerSHARC

Processor's, a host and global memory can be shared without any

external logic. This is the second way, in addition to page link ports,

that TigerSHARC DSP offers support for multiprocessor systems.

SDRAM and SBSRAM controllers allow for a glueless interface to

these types of memories. The external port also supports a fly by

mode which allows a host to access a global shared memory.
4. Applications
At a 250 MHz clock rate, the ADSP-TS101S [TigerSHARC] offers a DSP

industry-best 1500 MFLOPS peak performance and has native support

for 8, 16, 32, and 40-bit data types. With a 1.5 watt typical power

dissipation, 6 Mbits of on-chip memory, 14 channel zero-overhead

DMA engine, integrated SDRAM controller, parallel host interface,

cluster multiprocessing support, and page link port multiprocessing

support, the TigerSHARC is ideal for heat sensitive multiprocessing

applications.
Here are some of the target applications for floating-point DSPs:
"TigerSHARC's exceptional speed and functionality are suited for

applications in:
Defense - sonar, radar, digital maps, munitions guidance
Medical - ultrasound, CT scanners, MRI, digital X-ray
Industrial systems - data acquisition, control, test, and

inspection systems
Video processing - editing, printers, copiers
Wireless Infrastructure - GSM, EDGE, and 3G cellular base

stations."
5. Advantages of Tiger SHARC Processor
The Analog Devices TigerSHARC Processor architecture provides the

greatest marriage of performance and flexibility enabling the most

cost effective solution for baseband processing and other

applications within the Wireless Infrastructure market space today.

Wireless Infrastructure manufacturers can consider many approaches

when developing baseband modem solutions for third generation

wireless communications networks (3G), however the TigerSHARC

Processor architecture provides the balance of attributes required

to satisfy the entire range of challenges facing their 3G

deployments.
The TigerSHARC Processor is the heart of a software defined

solution for baseband modems where all of the implementation occurs

in software rather than in hardware as is the approach taken by

ASIC and other competing DSP solutions. The TigerSHARC Processor

allows for the infrastructure vendor to establish a single baseband

processing platform for all of the 3G standards with easily

implemented software changes to update functionality and speed time

to market.
The very powerful architecture of the TigerSHARC, combining the

best elements of RISC and DSP cores, is highly suited to deliver

the performance required for upcoming applications in 3G mobile

communications, xDSL technologies and imaging systems. The Static

Superscalar architecture maintains determinism for security-

sensitive applications and the high number of internal registers

allows the efficient use of a high-level language, speeding up the

development process of the designers.
6. Conclusion
As a result of its "Load Balancing" capabilities, high internal

and external bandwidth, large integrated memory and unmatched level

of flexibility, the TigerSHARC Processor proves to be an

unconventional but extremely effective solution for baseband signal

processing. In future generations of the TigerSHARC Processor we

intend to continue the trend towards reduced systems cost and

component count while increasing the functionality of the solution

through clock speed enhancements and an expanded instruction set.
7. References
analogprocessors/tigersharc
analogprocessors/teaching Resources
ener.ucalgory.co/People/Smith/ECE-ADI-PROJECT
answers.com

ashish rawat · 08-16-2017, 10:20 PM

[attachment=3873]

SHARC
S uper H arvard ARC hitecture

Presented By:
Nagendra Doddapaneni

Overview
Harvard Architecture
Super Harvard Architecture
TigerSHARC processor

Outline
Background
Harvard Architecture
- Why
- What
Modern CPU Chip Design
Super Harvard Architecture
TigerSHARC Processor

Background
von Neumann Architecture
- Single storage for instructions and data
- Digital Signal Processors
- Specialized microprocessor designed specifically for digital signal processing, generally in real time

Why Harvard Architecture
von Neumann bottleneck
( memory bound )
DSP applications
In von Neumann architecture
- Either reading an instruction
- Or reading/writing from/to memory
-
- What is Harvard Architecture
Physically separate storage and signal pathways for instruction and data
Next instruction fetched, when executing current instruction
Program memory can be small and wide
Data memory can be large and narrower

Modern CPU chip design
Incorporate features from both architectures
On chip cache memory divided into instruction cache and data cache.
Harvard architecture used when CPU accesses cache memory.
On a cache miss, off chip main memory is accessed using von Neumann architecture.
Main memory is not separated into data and instruction sections.

Super Harvard Architecture
Cache used to store instructions, leaving both instruction bus and data bus free to fetch operands
Harvard Architecture + cache = Extended Harvard Architecture or Super Harvard Architecture

-
TigerSHARC Processor

TigerSHARC
Instruction Parallelism and SIMD Operation
Core can execute simultaneously one to four 32-bit instructions encoded in single instruction line (VLIW).
Can execute in parallel Depends on .
- Instruction line resources each requires
- Source and Destination of registers used
Supports SIMD operations through the use of both Computational Blocks in parallel.
Each Computational Block can execute four 16-bit or eith 8-bit SIMD computations in parallel.

TigerSHARC
Integer ALU
31 32 bit general registers + 1 status register + 8 dedicated registers for circular buffers
Performs integer ALU operations and data addressing
ALU instructions: ADD, SUB, ARS, LRS (right shifts only), ROT (left and right), AND NOT, NOT, OR, XOR, ABS, MIN, MAX, CMP
Status flags: zero (Z), negative (N), overflow (V), carry ©
Instruction conditions: EQ, LT, LE, NEQ, NLT, NLE
Instruction options: unsigned (U), circular buffer (CB), bit reverse (BR), computed jump (CJMP)
Address related operations: data address generation, circular buffers, bit reverse, UREG moves, DAB control.

TigerSHARC Computational Blocks
X and Y Register File
Register File Syntax
- Each Block has 32x32 bit Data registers
- Each register can store 4x8 bit, 2x16 bit or 1x32 bit words.
- Registers can be combined into dual or quad groups. These groups can store 8, 16, 32, 40 or 64 bit words.
TigerSHARC Computational Blocks
X and Y Register File
Register File Syntax
- Volatile registers in each block
24 Volatile Data registers in each block
- XR0 XR23
- YR0 YR23
2 ALU summation registers in each block
- XPR0, XPR1, YPR0, YPR1
5 MAC accumulate registers in each block
- XMR0 XMR3, YMR0 YMR3
- XMR4, YMR4 Overflow registers

TigerSHARC
X and Y ALU
2x64 bit input paths
2x64 bit output paths
8, 16, 32, or 64 bit addition/subtraction - Fixed-point
32 or 64 bit logical operations - fixed-point
32 or 40 bit floating-point operations
Sample ALU Instruction
Example of 16 bit addition
XYSR1:0 = R31:30 + R25:24
Performs addition in X and Y Compute Blocks

TigerSHARC
Multiplier
Operates on fixed, floating and complex numbers.
Fixed-Point numbers
- 32x32 bit with 32 or 64 bit results
- 4 (16x16 bit) with 4x16 or 4x32 bit results
Floating-Point numbers
- 32x32 bit with 32 bit result
- 40x40 bit with 40 bit result
Complex Numbers
- 32x32 bit with 32 bit result
- Fixed-point only
Results stored in MR register
TigerSHARC
Multiplier

TigerSHARC
Shifter
Operates on one 64-bit, one or two 32-bit, two or four 16-bit, and four or eight 8-bit fixed-point operands
Shifts and rotates bits
manipulation operations, like bit set, clear, toggle and test
Bit FIFO operations to support bit streams
TigerSHARC Processor
Processor Architecture
Integer ALU
Computational blocks
- X and Y Register File
- X and Y ALU
- Multiplier
- Shifter
- CLU <-
Program Sequencer
J and K data buses
I bus data bus
TigerSHARC CLU
CLU instructions are designed to support different algorithms used for communications applications
Algorithms supported are
- Viterbi Decoding (minimal distance decoding algorithm)
- Turbo-code Decoding (variant of Viterbi decoding)
- De-spreading for Code Division Multiple Access (CDMA) systems (used for tasking a signal in wide Pseudo Noise spread bandwidth)

TigerSHARC
Program Sequencer
Supplies instruction addresses to memory
IAB caches up to five fetched instruction lines waiting to execute
It extracts an instruction line from IAB and distributes to appropriate core component for execution
Determine flow control for instructions like JMP, CALL
Reduce branch delays using branch prediction and BTB

TigerSHARC
architecture at a glance
TigerSHARC Buses
DRAM divided into 6 blocks of 4Mbits
6 blocks connect to four 128-bit wide internal buses through a crossbar connection
Internal bus architecture provides a total memory bandwidth of 32Gbytes/sec
Core and I/O can access
- twelve 32-bit data words
- four 32-bit instructions
per cycle

TigerSHARC
DMA Controller
On-chip, with 14 DMA channels
Provide zero-overhead data transfers
Operates independently and invisibly to the DSP s core

References
ANALOG DEVICES
- http://analogprocessors/processors/tiger...index.html
- http://analogprocessors/processors/sharc/index.html
- http://analogprocessors/resources/teachi...urces.html
ECE-ADI-PROJECT HOME PAGE
- http://enel.ucalgary.ca/People/Smith/ECE...index.html
- http://enel.ucalgary.ca/People/Smith/ECE...sFrame.htm

Summary
What is Harvard Architecture
What is Super Harvard Architecture
TigerSHARC processor architecture
How TigerSHARC is faster for targeted DSP applications
Questions
Thank You.

electrazy · 08-16-2017, 10:20 PM

[attachment=7678]

ABSTRACT

The Tiger SHARC processor is the newest and most power member of this family which incorporates many mechanisms like SIMD, VLIW and short vector memory access in a single processor. This is the first time that all these have been combined in a real time processor.
The TigerSHARC DSP is an ultra high performance static superscalar architecture that optimized for tele-communications infrastructure and other computationally demanding applications.
The unique architecture combines elements of RISC, VLIW and standard DSP processors to provide native support for 8, 16,and 32-bit fixed, as well as floating point data types on single chip. Large on-chip memory, extremely high internal and external bandwidths and dual compute blocks provide the necessary capabilities to handle a vast array of computationally demanding, large signal processing tasks.

INTRODUCTION
1.1 Analog and digital signals
In many cases, the signal of interest is initially in the form of an analog electrical voltage or current, produced for example by a microphone or some other type of transducer. An analog signal must be converted into digital form before DSP techniques can be applied. An analog electrical voltage signal, for example, can be digitized using an electronic circuit called an analog-to-digital converter or ADC. This generates a digital output as a stream of binary numbers whose values represent the electrical voltage input to the device at each sampling instant.
1.2 Signal processing
Signals commonly need to be processed in a variety of ways. For example, the output signal from a transducer may well be contaminated with unwanted electrical "noise". The electrodes attached to a patient's chest when an ECG is taken measure tiny electrical voltage changes due to the activity of the heart and other muscles. The signal is often strongly affected by "mains pickup" due to electrical interference from the mains supply. Processing the signal using a filter circuit can remove or at least reduce the unwanted part of the signal. Increasingly nowadays, the filtering of signals to improve signal quality or to extract important information is done by DSP techniques rather than by analog electronics.
1.3 Digital Signal Processing

Digital signal processing (DSP) is the study of signals in a digital representation and the processing methods of these signals. DSP and analog signal processing are subfields of signal processing Digital Signal Processing is carried out

by mathematical operations. In comparison, word processing and similar programs merely rearrange stored data. This means that computers designed for business and other general applications are not optimized for algorithms such as digital filtering and Fourier analysis. Digital Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks. These devices have seen tremendous growth in the last decade, finding use in everything from cellular telephones to advanced scientific instruments. In fact, hardware engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use "DSP" to mean Digital Signal Processing
1.4 Development of DSP
The development of digital signal processing dates from the 1960's with the use of mainframe digital computers for number-crunching applications such as the Fast Fourier Transform (FFT), which allows the frequency spectrum of a signal to be computed rapidly. These techniques were not widely used at that time, because suitable computing equipment was generally available only in universities and other scientific research institutions.
1.5 Digital Signal Processors (DSPs)
DSP processors are microprocessors designed to perform digital signal processing- the mathematical manipulation of digitally represented signals. The introduction of the microprocessor in the late 1970's and early 1980's made it possible for DSP techniques to be used in a much wider range of applications. However, general-purpose microprocessors such as the Intel x86 family are not ideally suited to the numerically-intensive requirements of DSP, and during the 1980's the increasing importance of DSP led several major electronics manufacturers

(such as Texas Instruments, Analog Devices and Motorola) to develop Digital Signal Processor chips - specialised microprocessors with architectures designed specifically for the types of operations required in digital signal processing. (Note that the acronym DSP can variously mean Digital Signal Processing, the term used for a wide range of techniques for processing signals digitally, or Digital Signal Processor, a specialised type of microprocessor chip). Like a general-purpose microprocessor, a DSP is a programmable device, with its own native instruction code. DSP chips are capable of carrying out millions of floating point operations per second, and like their better-known general-purpose cousins, faster and more powerful versions are continually being introduced. DSPs can also be embedded within complex "system-on-chip" devices, often containing both analog and digital circuitry.

Advantage over other Microprocessors
Single cycle multiply-accumulate operations(MAC)
Real time performance
Flexibility and Reliability
Increased system performance
Reduced cost
Harvard architecture

madhu.j · 08-16-2017, 10:20 PM

[attachment=5302]
SHARC Processor

a special class of microprocessors that are optimized for computing the real- time
Calculations used in signal processing
*DSPs have an architecture that simplifies application designs and makes low-cost signal processing a reality