skip to main content
research-article
Open access

Analysis of dependence tracking algorithms for task dataflow execution

Published: 01 December 2013 Publication History

Abstract

Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multicore and single-core processors due to the need for writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling, increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.
Task dataflow programming models have a high potential to simplify parallel programming because they alleviate the programmer from identifying precisely all intertask dependences when writing programs. Instead, the task dataflow runtime system detects and enforces intertask dependences during execution based on the description of memory accessed by each task. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel, taking into account dependences specified in the task graph.
Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this article, we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output, and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs, and lists. We also consider a fourth edgeless scheme that synchronizes tasks using integers. Analysis using microbenchmarks shows that the graph representation is not always scalable and that the edgeless scheme introduces least overhead in nearly all situations.

References

[1]
Agrawal, K., Leiserson, C. E., and Sukha, J. 2010. Executing task graphs using work-stealing. In Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10). 1--12.
[2]
Alvanos, M., Tzenakis, G., Bilas, A., and Nikolopoulos, D. S. 2011. Design and evaluation of a task-based parallel H.264 video encoder for heterogeneous processors. In Proceedings of SAMOS XI: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. 217--224.
[3]
Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. 2010. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2, 187--198.
[4]
Barcelona Supercomputing Center. 2008. SMP Superscalar (SMPSS) User's Manual, 2.2 ed. Barcelona Supercomputing Center.
[5]
Berge, C. 1973. Graphs and Hypergraphs. North-Holland.
[6]
Best, M. J., Mottishaw, S., Mustard, C., Roth, M., Fedorova, A., and Brownsword, A. 2011. Synchronization via scheduling: Techniques for efficiently managing shared state. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 640--652.
[7]
Bienia, C. 2011. Benchmarking Modern Multiprocessors. PhD Dissertation, Princeton University.
[8]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., and Dongarra, J. 2010. DAGuE: A Generic Distributed DAG Engine for High Performance Computing. Technical Report. Innovative Computing Laboratory.
[9]
Budimlić, Z., Burke, M., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D., Sarkar, V., Schlimbach, F., and Taşirlar, S. 2010. Concurrent collections. Sci. Program. 18, 3--4, 203--217.
[10]
Chan, E., Quintana-Orti, E. S., Quintana-Orti, G., and van de Geijn, R. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th Annual ACM Symposium on Parallelism in Architectures and Applications. 116--125.
[11]
Chi, C. C. and Juurlink, B. 2011. A QHD-capable parallel H.264 decoder. In Proceedings of the International Conference on Supercomputing. 317--326.
[12]
Conover, W. J. and Iman, R. L. 1981. Rank transformations as a bridge between parametric and nonparametric statistics. American Statistician 35, 3, 124--129.
[13]
Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. 2011. OmpSs: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21, 2, 173--193
[14]
Dongarra, J., Beckman, P., et al. 2011. The international exascale software project roadmap. International Journal of High Performance Computer Applications 25, 1, 3--60.
[15]
Frigo, M., Halpern, P., Leiserson, C. E., and Lewin-Berlin, S. 2009. Reducers and other Cilk++ hyperobjects. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures. 79--90.
[16]
Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multi-threaded language. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation. 212--223.
[17]
Gupta, G. and Sohi, G. S. 2011. Dataflow execution of sequential imperative programs on multicore architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 59--70.
[18]
Hennessy, J. L. and Patterson, D. A. 2003. Computer architecture: A Quantitative Approach, 3rd ed. Morgan Kaufmann.
[19]
Jenista, J. C., Eom, Y. h., and Demsky, B. C. 2011. OoOJava: Software out-of-order execution. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. 57--68.
[20]
Kurzak, J. and Dongarra, J. 2009. Fully Dynamic Scheduler for Numerical Computing on Multicore Processors. Technical Report UT-CS-09-643. LAPACK Working Note 220.
[21]
Perez, J. M., Badia, R. M., and Labarta, J. 2008. A dependency-aware task-based programming environment for multicore architectures. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER'08). 142--151.
[22]
Perez, J. M., Badia, R. M., and Labarta, J. 2010. Handling task dependencies under strided and aliased references. In Proceedings of the International Conference on Supercomputing. 263--274. Retrieved from http://dx.doi.org/10.1145/1810085.1810122.
[23]
Tzenakis, G., Papatriantafyllou, A., Kesapides, J., Pratikakis, P., Vandierendonck, H., and Nikolopoulos, D. S. 2012. BDDT: Block-level dynamic dependence analysis for deterministic task-based parallelism. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 301--302. Retrieved from http://dx.doi.org/10.1145/2145816.2145864.
[24]
Vandierendonck, H., Chronaki, K., and Nikolopoulos, D. S. 2013. Deterministic Scale-Free Pipeline Parallelism with Hyperqueues. In Proceedings of Supercomputing'13: High-Performance Computing, Networking, Storage and Analysis. 32:1--32:12. Retrieved from http://dx.doi.org/10.1145/2503210.2503233.
[25]
Vandierendonck, H., Pratikakis, P., and Nikolopoulos, D. S. 2011a. Parallel programming of general-purpose programs using task-based programming models. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11).
[26]
Vandierendonck, H., Tzenakis, G., and Nikolopoulos, D. S. 2011b. A unified scheduler for recursive and task dataflow parallelism. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. 1--11.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
December 2013
1046 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2541228
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2013
Accepted: 01 November 2013
Revised: 01 November 2013
Received: 01 June 2013
Published in TACO Volume 10, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Task dataflow
  2. scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)20
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media