Efficient Approximate DCT Transform For Image Compression

Parvin Tamboli 1, Aparna Shinde 2

1 M.E.(VLSI & ES), D.Y.Patil College of Engineering, Pune, Maharashtra, India
2 Department of E & TC, D.Y.Patil College of Engineering, Pune, Maharashtra, India

Abstract: The need for an efficient technique for compression of images ever increasing because the original images need large amounts of disk space seems to be a big disadvantage during transmission & storage. The discrete cosine transform (DCT) is widely used in image compression standards due to its energy compaction properties. In this paper, a multiplierless 8-point approximate DCT requiring only 14 additions is introduced. The proposed algorithm was assessed in image compression. The transform offer superior compression performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware using additions and subtractions only.

Keywords: Approximate DCT, Image Compression, Low complexity Algorithm

1. Introduction

Image compression is essential for many applications such as multimedia, internet and mobile communication. Transform-based compression techniques are widely used in such applications. The discrete cosine transform (DCT) is an essential mathematical tool in both image and video coding. DCT is widely adopted in several image and video coding standards such as JPEG, MPEG-1, MPEG-2, H.261, and H.263. This is mainly due to its good energy compaction properties.

The several efficient algorithms were developed for fast computation of DCT. As fast algorithms can significantly reduce the computational complexity of computing the DCT, floating-point operations are still required. The floating-point operations are expensive in terms of circuitry complexity and power consumption. Therefore, minimizing the number of floating-point operations is required in a fast algorithm. The solution to this problem is use of approximate transforms. In this context, several work has been done. The prominent techniques include the Bouguezel-Ahmad-Swamy (BAS) series of algorithms [2],[3] and the Cintra-Bayer (CB) approximate DCT [4].

In this paper, we presented a 8x8 orthogonal transform matrix. The proposed matrix is sparse and satisfies orthogonality property. A fast algorithm for computation of the matrix is also presented. The introduced DCT approximation possesses an extremely low arithmetic complexity, requiring only 14 additions.

2. Proposed Work

The transformation matrix for 8 point DCT approximation has following format, [diagonal matrix] × [low complexity matrix]

The diagonal matrix usually contains irrational numbers in the form 1/√m, where m is a small positive integer. The irrational numbers required in the diagonal matrix would require an increased computational complexity. However, in the context of image compression, the diagonal matrix can simply be absorbed into the quantization step of JPEG-like compression procedures. Therefore, in this case, the complexity of the approximation is bounded by the complexity of the low-complexity matrix. Since the entries of the low complexity matrix comprise only powers of two in {0,±1/2,±1,±2} so that null multiplicative complexity is achieved.

The aim is to derive low-complexity approximate DCT. For this, a candidate matrix is find that possess low computation cost. The cost of a transformation matrix is defined as the number of arithmetic operations required for its computation. The good candidates matrix has entries such that it do not require multiplication operations. The following DCT approximation is proposed.

\[ C = D \cdot T \]

Where \( T \) is transformation matrix and it is given as

\[
\begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\
1 & -1 & 1 & -1 & -1 & 1 & -1 & 1 \\
0 & 0 & 0 & 1 & -1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & -1 & 0 & 0 \\
1 & 0 & 0 & 0 & 0 & 0 & 0 & -1 \\
1 & 0 & 0 & -1 & 1 & 0 & 0 & 1 \\
0 & -1 & 1 & 0 & 0 & 1 & 1 & 0 \\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 1 \\
\end{bmatrix}
\]

And the diagonal matrix \( D \) is given as

\[
D = \text{diag} \left( 1/\sqrt{8}, 1/\sqrt{2}, 1/2, 1/\sqrt{2}, 1/\sqrt{8}, 1/\sqrt{2}, 1/2 \right)
\]

The transformation matrix \( T \) has entries in \{0, ±1\} and it can be given a sparse factorization according to

\[ T = P_4 \cdot A_{12} \cdot A_{11} \cdot A_1 \]

Where

\[ A_{11} = \text{diag} \left( 1, 0, 0, 1 \right) \]

\[ P_4 = \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & -1 & 0 \\ 1 & 0 & 0 & -1 \end{bmatrix}, I_4 \]


transformation the algorithm gives back the transformed 8 pixels for performing the transform. Adder A1 = diag ([1 1 -1], I8 )

\[
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & -1 \\
1 & 0 & 0 & 0 & 0 & 0 & 0 & -1 \\
\end{bmatrix}
\]

And P3 is the permutation (1) (2 5 6 8 4 3 7)

3. Digital Architecture and Realizations

The proposed architecture is as shown in fig.1. This digital architectures were designed using Xilinx software. The block diagram consists of three units such as Block RAM, Pixel Access & Control Unit and DCT Unit.

1) Block RAM

Block RAM is built in available with various Xilinx FPGAs. It needs to be configured depending upon the requirements of the design. The Xilinx Block Memory Generator (BMG) core is an advanced memory constructor that generates area and performance-optimized memories using embedded block RAM resources in Xilinx FPGAs. In our design we are exploring the Block Ram to store the raw image and after performing the DCT operation to save the transformed image.

The simple dual port RAM is generated. The raw image is gray image of 256 x 256 pixel data. The image requires 64kbytes of memory space of block RAM. The block RAM width is of 128 bytes. This configuration of the block ram helps to fetch 8 pixels at a time improving the latency and the performance of the DCT operation. Also the proposed algorithm needs 8 pixels for performing the transform. Overall the throughput is increased as well as the design becomes simplistic for implementation. All 256X256 image pixels are sequentially stored in the block RAM. Transformed image also has the similar configuration as described above. For every 8 pixels given for the transformation the algorithm gives back the transformed 8 pixels.

2) Pixel Access and Control Unit

Pixel access and control unit is responsible for fetching the image pixel data from the block RAM. Since the DCT transformation algorithm requires 8 pixels for performing the transformation operation, this unit accesses the block RAM to get the corresponding pixel data. As stated in the block RAM section above the raw image is stored in the sequential manner with 8 pixels at one address location (block RAM width being 128 bytes). Hence for performing transform, fetch unit has to access one location of the block RAM at a time. Thus it is a single read operation on the block RAM. This unit also keeps tab on the number of pixels being accessed to understand the various image dimensions for portability purpose. Internally it has counters to understand the number of accesses to be made to get single row of 256x256 image matrix and the specific row which is being accessed.

This unit operates in synchronization with the DCT unit to whom it provides the pixel data. Even before making access to the Block RAM memory, the state machines within this unit polls for the readiness of the DCT unit. When DCT unit performs handshake with this unit then the pixel data is accessed from the Block RAM and served to DCT unit. The handshake is performed for every access of pixel data. This unit communicates its availability to DCT unit. It accepts the transformed data and is responsible for writing back this data to block RAM. At every point it is guaranteed that the garbage data will never reach the image store unit. This unit has the default storage base address. It starts writing back the transformed data from this base address onwards. It comprises a simple sequential counter to generate the address pointer. This unit will also communicate the end of DCT transformation.

3) DCT Unit

DCT unit is core block of the entire design. Its responsibilities are:
- Request for raw image pixel data
- Perform the DCT transformation operation on pixel data
- Off load signed data of the transformation to pixel access and control unit.

It functions with tight handshake with all the design units connected to it. This ensures that data integrity is tightly maintained and loss of any pixel data is strictly avoided. The raw pixel data is requested and taken by DCT unit. The main DCT algorithm comprises of 14 adders. Adder implementation is purely combinational in nature. But care is taken to eliminate the dependencies of the resultant of the adders. Hence the overall implementation of the adders have some controlling glue logic which is sequential circuit. We have adopted the pipeline approach where all the DCT addition operation is divided into 3-stage pipeline. In first stage of the pipeline the additions which merely requires raw pixel data is performed. In the second stage of the pipeline the additions of the raw pixel and the resultant of the first pipeline stage is performed. Third stage of the pipeline performs addition of raw pixels and the second stage of the algorithm. The resultant of the third stage of the pipeline is the DCT transformed data. The pipeline stages are controlled using the state machine which is tightly coupled with pixel access and control unit.

4. Simulation Results

The proposed architecture is implemented in VHDL. The simulation result is as shown in fig.2


www.ijsr.net

Licensed Under Creative Commons Attribution CC BY
5. Computational Complexity and Performance Analysis

We consider the arithmetic complexity as a figure of merit for estimating the computational complexity. The arithmetic complexity consists of the number of elementary arithmetic operations (additions/subtractions, multiplications/divisions, and bit shift operations) required to compute a given transformation.

In the context of image compression, the complexity of the diagonal matrix can be absorbed into the quantization step, therefore the diagonal matrix does not contribute towards an increase of the arithmetic complexity [9]. The peak signal to noise ratio (PSNR) is taken as parameter for performance measures. We get average PSNR=27.2087 dB at a compression ratio of 50%. The figure 3 shows the resulting compressed Leena image obtained from the above described procedure for 50% compression ratio.

6. Conclusion

In this paper, we proposed a low-power 8-point DCT approximation that requires only 14 addition operations. The hardware architecture for the proposed transform is implemented in VHDL. The proposed transform offers low arithmetic complexity.

References