Free Academic Seminars And Projects Reports

Full Version: An Efficient VLSI Implementation for the 1D Convolutional Discrete Wavelet Transform
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Abstract
This paper presents an efficient implementation of a convolution-based 1D discrete wavelet transform (DWT). The proposed architecture combines several optimizations that improve the performance of the hardware design in terms of throughput and power dissipation. We designed and analyzed the performance of numerous DWT architectures using pertinent metrics and cost functions that assess the impact of the design optimizations. We synthesized our VLSI architectures using a 0.18 standard cell library. The final VLSI design combines polyphase decimated FIR filters to reduce power dissipation, pipelined computational cells for higher throughput, and datainterleaving for lower chip area. An analytical comparison with other existing DWT implementations illustrates a two fold improvement in throughput for the proposed architecture.
I. INTRODUCTION
The vast sizes of data transmitted over a channel demands improved methods for compressing and encoding a signal to ensure better bandwidth utilization. The DWT is a relatively new and computationally efficient method for decomposing a signal into several frequency bands and compressing the signals using various coding schemes. The non-even distribution of the signal energy in the different subbands allows the original signal to be compressed using wavelets and later on reconstructed with minimal loss of information. Several VLSI implementations for the DWT algorithm have emerged over the past decade in an attempt to improve the hardware performance [1] [5]. Most hardware implementations address one or two essential design optimizations to improve their performance in terms of area, throughput or power dissipation. High-throughput and lower-power VLSI implementations are considered two of the essential optimization axes, especially when considering portable and real-time DSP applications. One simple DWT design modeled at the register transfer level (RTL) was presented by Baganne [1] where a threelevel DWT was implemented using a binary tree structure. This particular structure forms the basis for expanding the design space to alternate DWT architectures that improve the area, throughput and power dissipation. Fig. 1 illustrates a signal flow graph (SFG) for the hardware implementation of a multi-level DWT system. The inefficiencies in this design include a large critical path delay in the filter structures, large design area in the use of redundant FIR filters, and inefficient power dissipation through the use of a cascade decimated FIR filter. The advantage of such a design, however, is the minimal filter structures Fig. 1. Three-level, eight-channel subband decomposition clock latency to process the signal and the easiness in hardware implementation. Most conventional DWT implementations employ a single processing element that performs the computations by interleaving the data from successive levels of decomposition [2]. The interleaved implementation reduces the overall design area, but may exhibit a large critical path delay depending on the filter structure and length. An attempt to reduce this delay through pipelining may increase the latency resulting in a degradation in throughput performance for real-time applications. A similar method to reduce area and improve throughput was proposed in [3] wherein the highpass and lowpass filter coefficients were interleaved to reduce the number of multipliers rather than interleaving the data. Other designs have been considered that utilized a folded DWT structure for storing the intermediate outputs from each level of decomposition in a memory block before processed once more by the same DWT block [4]. The folded structures tend to exhibit a large latency since the successive levels of decomposition are interleaved with the preceding levels and tend to have high complexity in the control logic. The advantage of such designs is the reduced critical path that performs the multiply-accumulate operations and a reduction in chip area.

Download full report
http://googleurl?sa=t&source=web&cd=2&ve...616938.pdf%3Farnumber%3D4616938&ei=_GBcToUGgcqsB9Ot2agP&usg=AFQjCNFtvrgJb8istRjqeU54J0ZB8W6Rww