skip to main content
10.1145/2578948.2560684acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
tutorial

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Published: 07 February 2014 Publication History

Abstract

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding.
In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCL-programmable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies.
We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores.
We have evaluated our approach for a total of 7194 JPEG images across three high- and middle-end CPU--GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.

References

[1]
H. Akaike. Likelihood of a model and information criteria. Journal of Econometrics, 16(1):3--14, 1981.
[2]
Alexa Top 500 Global Sites. http://www.alexa.com/topsites, retrieved Feb. 2013.
[3]
D. Beaver, S. Kumar, H. C. Li, J. Sobel, P. Vajgel, and F. Inc. Finding a needle in haystack: Facebook's photo storage. In Proc. of OSDI, 2010.
[4]
R. Capocelli, L. Gargano, and U. Vaccaro. On the characterization of statistically synchronizable variable-length codes. Information Theory, IEEE Transactions on, 34(4):817--825, 1988.
[5]
L. Chen, X. Huo, and G. Agrawal. Accelerating MapReduce on a coupled CPU-GPU architecture. In Proc. SC'12. IEEE Press, 2012.
[6]
GPUJPEG: JPEG compression and decompression accelerated on GPU. http://sourceforge.net/p/gpujpeg/home/Home/, retrieved Feb. 2013.
[7]
C. Gregg and K. Hazelwood. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proc. of ISPASS'11, pages 134--144, 2011.
[8]
W. G. Horner. A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London, 109:pp. 308--335, 1819.
[9]
Image Compression Benchmark. http://www.imagecompression.info/test_images/, retrieved July 2013.
[10]
Instagram Press Center. http://instagram.com/press/, retrieved Jun. 2013.
[11]
J. Hong et al. Design, implementation and evaluation of a task-parallel JPEG decoder for the libjpeg-turbo library. International Journal of Multimedia and Ubiquitous Engineering, 7(2), 2012.
[12]
S. T. Klein and Y. Wiseman. Parallel Huffman decoding with applications to JPEG files. Computer J., 46:487--497, 2003.
[13]
C. Lee, W. W. Ro, and J.-L. Gaudiot. Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids. In Interaction between Compilers and Computer Architectures. IEEE, 2012.
[14]
Libjpeg. http://libjpeg.sourceforge.net/, retrieved Jun. 2013.
[15]
Libjpeg Turbo. http://sourceforge.net/projects/libjpeg-turbo, retrieved Jun. 2013.
[16]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of MICRO 42. ACM, 2009.
[17]
Nvidia Corporation. OpenCl programming guide for the CUDA architecture, version 4.2. Technical report, 2012.
[18]
Nvidia Developer. CUDA CUFFT library manual version 5.0, 2008.
[19]
W. B. Pennebaker and J. L. Mitchell. JPEG Still Image Data Compression Standard. Kluwer, 1st edition, 1992.
[20]
C. Poynton. Chroma subsampling notation. Retrieved July 2013.
[21]
S. Shee, A. Erdos, and S. Parameswaran. Architectural exploration of heterogeneous multiprocessor systems for JPEG. IJPP, 36(1):140--162, 2008.
[22]
The WebKit Open Source Project. http://www.webkit.org/, retrieved Jun. 2013.
[23]
G. K. Wallace. The JPEG still picture compression standard. Communications of the ACM, pages 30--44, 1991.
[24]
Z. Wang, L. Zheng, Q. Chen, and M. Guo. CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems. In Proc. of PMAM'13, pages 107--114. ACM, 2013.
[25]
Z. Yang, Y. Zhu, and Y. Pu. Parallel image processing based on CUDA. In Proc. of CSSE'08. IEEE Computer Society, 2008.
[26]
A. Yukihiro, A. Takeshi, and M. Nakajima. A fast DCT-SQ scheme for images. IEICE TRANSACTIONS (1976-1990), 71(11):1095--1097, 1988.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores
February 2014
156 pages
ISBN:9781450326575
DOI:10.1145/2578948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU programming
  2. JPEG decoding
  3. OpenCL
  4. dynamic work partitioning
  5. heterogeneous multicores
  6. offline profiling

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

PPoPP '14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media