Login

lakeer · 10-04-2017, 09:00 PM

Abstract
High-definition video applications, such as digital
TV and digital video cameras, require high processing performance
for high-quality visual images in addition to a complex
video CODEC. Pre-/postprocessing to improve video quality
is becoming much more important because requirements for
pre-/postprocessing vary among applications and processing
algorithms have not been stabilized. Therefore, a new processor
architecture that has a highly parallel datapath is needed. In this
paper, we introduce a VLIW vector media coprocessor, vector
coprocessor (VCP), that includes three asymmetric execution
pipelines with cascaded SIMD ALUs. To improve performance
efficiency, we reduce the area ratio of the control circuit while
increasing the ratio of the arithmetic circuit. The total gate count
of VCP is 1268 kgates and its maximum operating frequency is
300 MHz at 90-nm CMOS process. Some of the processing kernels
in an adaptive prefilter that is applied to preprocessing for video
encoding are evaluated. In the case of the edgeness and the sum of
absolute differences, the performance is 183 giga operations per
second. VCP offers enough performance for HD video processing
and good cost-performance while all processing pipeline units
operate effectively.
Index Terms Single instruction stream, multiple data stream
(SIMD), vector coprocessor (VCP), very long instruction word
(VLIW).
I. INTRODUCTION
NOWADAYS, high-definition video applications, such as
digital TV and digital video cameras require high processing
performance for high-quality visual images in addition
to a complex video CODEC. Pre-/postprocessing to improve
video quality is becoming much more important because requirements
for pre-/postprocessing vary among applications and
processing algorithms have not been stabilized.
We focused on the fact that image processing for much video
pre-/postprocessing is characterized by operating on sets of data
elements as vectors that evolve continuously in time and that
image processing algorithms are characterized by frequent executions
of the same computation on each of the elements in a
vector and by execution of sequences of operations on vector elements.
With the execution of such sequences of operations, an
effective implementation includes performing loop operations
using the same single instruction stream, multiple data stream
(SIMD) ALUs for as many times as necessary in the sequence
and structuring the hardware as a pipeline with the cascaded
ALUs. This would achieve a high-performance and energy-efficient
architecture while providing reusable hardware. Reuse for
many video coding applications would realize a low development
cost.
In this paper, we introduce a very long instruction word
(VLIW) vector coprocessor, vector coprocessor (VCP),
that has been customized to the computation requirements of
image processing. The coprocessor architecture includes three
asymmetric execution pipelines with cascaded SIMD ALUs to
exploit the loop-level parallelism. The new architecture of VCP
is a combination of cascaded SIMD ALUs and asymmetric
parallel pipelines, which provide good cost-performance to enhance
specialized datapaths for lower-level image processing,
such as preprocessing and postprocessing, at the expense of
generality compared with conventional processors with SIMD
instructions. VCP is designed to be a coprocessor for image
processing of video CODECs and the width of SIMD ALUs
is limited to that of macroblocks of CODECs. Therefore, we
introduce a cascaded structure of SIMD ALUs to exploit high
parallelism. To achieve high performance with small hardware
size, we reduce the area ratio of the control circuit while
increasing the ratio of the arithmetic circuit. For instance, to
assume static optimizations by the compiler, the coprocessor
architecture does not have forwarding hardware. This allows
increasing the ratio of the arithmetic circuit.
The remainder of this paper is organized as follows. Section II
reviews previous related work. We introduce the architecture of
VCP in Section II. Section IV shows examples of image processing
kernels in an adaptive prefilter used in preprocessing
for high-definition video encoding. Finally, in Section V, we
present the hardware implementation and the performance evaluation
results and discuss the effectiveness of the architecture
in real-time adaptive prefilter processing and other image processing
kernels.

Download full report
http://googleurl?sa=t&source=web&cd=3&ve...799236.pdf%3Farnumber%3D4799236&ei=X2UITuSEPITUiAKn_aGlDQ&usg=AFQjCNG-e8H4CAfRZMttY1Gy4yvg9r5R_A&sig2=6wATA65_grgGOxfdybfzcA