skip to main content
research-article

A sound and complete abstraction for reasoning about parallel prefix sums

Published: 08 January 2014 Publication History

Abstract

Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance.
We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length $n$ if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg(n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg(n)) space and the result O(n2 \lg(n)) space, and is more feasible for large n than a method by Voigtlaender that uses O(n) space for the input and result but requires running O(n2) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction.
We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlaender's method when applied to large arrays.

Supplementary Material

JPG File (d2_left_t9.jpg)
MP4 File (d2_left_t9.mp4)

References

[1]
A. Betts, N. Chong, A. F. Donaldson, S. Qadeer, and P. Thomson. GPUVerify: a verifier for GPU kernels. In phOOPSLA, pages 113--132, 2012.
[2]
M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide SIMD many-core architectures. In phHPG, pages 159--166, 2009.
[3]
G. E. Blelloch. Scans as primitive parallel operations. phIEEE Trans. Comput., 38 (11): 1526--1538, 1989.
[4]
G. E. Blelloch. Prefix sums and their applications. In J. H. Reif, editor, phSynthesis of Parallel Algorithms. Morgan Kaufmann, 1990.
[5]
R. P. Brent and H.-T. Kung. A regular layout for parallel adders. phIEEE Trans. Computers, 31 (3): 260--264, 1982.
[6]
C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In phOSDI, pages 209--224, 2008.
[7]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In phWorkload Characterization, pages 44--54, 2009.
[8]
N. Chong, A. F. Donaldson, P. H. Kelly, J. Ketema, and S. Qadeer. Barrier invariants: A shared state abstraction for the analysis of data-dependent GPU kernels. In phOOPSLA, pages 605--622, 2013.
[9]
P. Collingbourne, C. Cadar, and P. H. J. Kelly. Symbolic testing of OpenCL code. In phHVC'11, pages 203--218, 2012.
[10]
P. Collingbourne, A. F. Donaldson, J. Ketema, and S. Qadeer. Interleaving and lock-step semantics for analysis and verification of GPU kernels. In phESOP, pages 270--289, 2013.
[11]
P. Cousot and R. Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In phPOPL, pages 238--252, 1977.
[12]
A. Danalis et al. The scalable heterogeneous computing (SHOC) benchmark suite. In phGPGPU 2010, pages 63--74, 2010.
[13]
io\uglu et al.(1990)E\ugecio\uglu, Gallopoulos, and Koç}PolynomialsO. E\ugecio\uglu, E. Gallopoulos, and C. Koç. A parallel method for fast and practical high-order Newton interpolation. phBIT Numerical Mathematics, 30: 268--288, 1990.
[14]
W. Gropp, E. Lusk, and A. Skjellum. phUsing MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 2nd edition, 1999.
[15]
A. Habermaier and A. Knapp. On the correctness of the SIMT execution model of GPUs. In phESOP, pages 316--335, 2012.
[16]
M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. In H. Nguyen, editor, phGPU Gems 3. Addison-Wesley, 2007.
[17]
R. Hinze. An algebra of scans. In phMPC, pages 186--210, 2004.
[18]
'c(2013)}BYTECODEM. Huisman and M. Mihel\vcić. Specification and verification of GPGPU programs using permission-based separation logic. In phBYTECODE, 2013.
[19]
012)}OpenCLKhronos OpenCL Working Group. The OpenCL specification, version 1.2, 2012.
[20]
D. E. Knuth. phThe Art of Computer Programming, volume 3. Addison-Wesley, 2nd edition, 1998.
[21]
P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. phIEEE Trans. Computers, C-22 (8): 786--793, 1973.
[22]
R. E. Ladner and M. J. Fischer. Parallel prefix computation. phJ. ACM, 27 (4): 831--838, 1980.
[23]
A. Leung, M. Gupta, Y. Agarwal, et al. Verifying GPU kernels by test amplification. In phPLDI, pages 383--394, 2012.
[24]
G. Li and G. Gopalakrishnan. Scalable SMT-based verification of GPU kernel functions. In phFSE, pages 187--196, 2010.
[25]
G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan. GKLEE: concolic verification and test generation for GPUs. In phPPoPP, pages 215--224, 2012.
[26]
D. Merrill and A. Grimshaw. Parallel scan for stream architectures. Technical Report CX2009--14, Department of Computer Science, University of Virginia, 2009.
[27]
012)}CUDANVIDIA. CUDA C programming guide, version 5.0, 2012.
[28]
B. C. Pierce. phTypes and Programming Languages. MIT Press, 2002.
[29]
(2006)}DBLP:conf/pvm/SandersT06P. Sanders and J. L. Tr\"aff. Parallel prefix (scan) algorithms for MPI. In phPVM/MPI, pages 22--29, 2006.
[30]
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In phIPDPS, pages 1--10, 2009.
[31]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In phGH, pages 97--106, 2007.
[32]
I. Sergeev. On the complexity of parallel prefix circuits. Technical Report TR13-041, Electronic Colloquium on Computational Complexity, 2013.
[33]
M. Sheeran. Functional and dynamic programming in the design of parallel prefix networks. phJ. of Funct. Program., 21 (1): 59--114, 2011.
[34]
J. Sklansky. Conditional-sum addition logic. phIRE Trans. Electronic Computers, EC-9: 226--231, 1960.
[35]
B. So, A. M. Ghuloum, and Y. Wu. Optimizing data parallel operations on many-core platforms. In phIn First Workshop on Software Tools for Multi-Core Systems (STMCS), 2006.
[36]
H. S. Stone. Parallel processing with the perfect shuffle. phIEEE Trans. Comput., 20 (2): 153--161, 1971.
[37]
J. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.
[38]
er(2008)}Voigtlaender08J. Voigtl\"ander. Much ado about two (pearl): a pearl on parallel prefix computation. In phPOPL, pages 29--35, 2008.

Cited By

View all

Index Terms

  1. A sound and complete abstraction for reasoning about parallel prefix sums

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 1
      POPL '14
      January 2014
      661 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2578855
      Issue’s Table of Contents
      • cover image ACM Conferences
        POPL '14: Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
        January 2014
        702 pages
        ISBN:9781450325448
        DOI:10.1145/2535838
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 January 2014
      Published in SIGPLAN Volume 49, Issue 1

      Check for updates

      Author Tags

      1. abstraction
      2. formal verification
      3. gpus
      4. parallel prefix sum computation

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media