skip to main content
article

Parallel program performance prediction using deterministic task graph analysis

Published: 01 February 2004 Publication History

Abstract

In this article, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalability, but have been less successful in practice for providing detailed insights and metrics for program performance (leaving these to measurement or simulation). We develop a conceptually simple modeling technique called deterministic task graph analysis that provides detailed performance prediction for shared-memory programs with arbitrary task graphs, a wide variety of task scheduling policies, and significant communication and resource contention. Unlike many previous models that are stochastic models, our model assumes deterministic task execution times (while retaining the use of stochastic models for communication and resource contention). This assumption is supported by a previous study of the influence of nondeterministic delays in parallel programs.We evaluate our model in three ways. First, an experimental evaluation shows that our analysis technique is accurate and efficient for a variety of shared-memory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly nonuniform task times, and significant communication and resource contention. The results also show that the deterministic assumption is crucial to permit accurate and yet efficient analysis of these programs. Second, we use three example programs to illustrate the predictive capabilities of the model. In two cases, broad insights and detailed metrics from the model are used to suggest improvements in load-balancing and the model quickly and accurately predicts the impact of these changes. In the third case, the model provides novel insights into the impact of program design changes that improve communication locality as well as load-balancing, via new (but general-purpose) metrics. Finally, we present results from a comparison of our model and representative stochastic models, and use these to characterize the conditions under which a deterministic model or stochastic models would be appropriate.

References

[1]
Adve, V. S. 1993. Analyzing the behavior and performance of parallel programs. Ph.D. dissertation, Department of Comupter Science, University of Wisconsin-Madison, Madison, WI.]]
[2]
Adve, V. S., Bagrodia, R., Browne, J. C., Deelman, E., Dube, A., Houstis, E., Rice, J. R., Sakellariou, R., Sundaram-Stukel, D., Teller, P. J., and Vernon, M. K. 2000. POEMS: End-to-end performance design of large parallel adaptive computational systems. IEEE Trans. Softw. Eng. (Special Issue on Software and Performance) 26, 11 (Nov.), 1027--1048.]]
[3]
Adve, V. and Sakellariou, R. 2000. Compiler synthesis of task graphs for a parallel system performance modeling environment. In Proceedings of the 13th International Workshop on Languages and Compilers for High Performance Computing (LCPC '00) (Yorktown Heights, N.Y.). ACM, New York.]]
[4]
Adve, V. S. and Vernon, M. K. 1993. The influence of random delays on parallel execution times. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, 61--73.]]
[5]
Adve, V. S. and Vernon, M. K. 1994. Performance analysis of mesh interconnection networks with deterministic routing. IEEE Trans. Paral. Dist. Syst. 5, 3 (Mar.), 225--246.]]
[6]
Alexandrov, A., Ionescu, M., Schauser, K. E., and Scheiman, C. 1995. LogGP: Incorporating long messages into the LogP model. In Proceedings of the 7th Annual Symposium on Parallel Algorithms and Architecture. ACM, New York.]]
[7]
Amdahl, G. M. 1967. Validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings. Vol. 30, pp. 483--485.]]
[8]
Ammar, H. H., Islam, S. M. R., Ammar, M., and Deng, S. 1990. Performance modeling of parallel algorithms. In Proceedings of the 1999 International Conference on Parallel Processing, pp. III 68--71.]]
[9]
Balasundaram, V., Fox, G., Kennedy, K., and Kremer, U. 1991. A static performance estimator to guide data partitioning decisions. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Williamsburg, Va.). ACM, New York.]]
[10]
Blumofe, R. D., Joerg, C. F., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, Calif.). ACM, New York, pp. 207--216.]]
[11]
Brewer, E. A. 1995. High-level optimization via automated statistical modeling. In Proceedings of the 5th ACM STAGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, Calif.). ACM, New York.]]
[12]
Brooks, E. D., III. 1988. PCP: A parallel extension of C that is 99% fat free. Tech. Rep. (Sept.), Lawrence Livermore National Laboratory.]]
[13]
Browne, J. C., Hyder, S. I., Dongarra, J., Moore, K., and Newton, P. 1995. Visual programming and debugging for parallel computing. IEEE Parallel and Distributed Technology 3, 1 (Spring).]]
[14]
Browne, S., Dongarra, J., Garner, N., London, K., and Mucci, P. 2000. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proc. SC'2000.]]
[15]
Crovella, M. E., LeBlanc, T. J., and Meira, W. 1995. Parallel performance prediction using the lost cycles toolkit. Tech. Rep. TR580, Dept. Computer Science, Univ. Rochester.]]
[16]
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., and von Eicken, T. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. ACM, New York.]]
[17]
Cvetanovic, Z. 1987. The effects of problem partitioning, allocation and granularity on the performance of multiple-processor systems. IEEE Trans. Comput. C-36, 4, 421--432.]]
[18]
Dikaiakos, M. 1994. Functional algorithm simulation. Ph.D. dissertation, Department of Computer Science, Princeton Univ. Princeton, N.J.]]
[19]
Dikaiakos, M., Rogers, A., and Steiglitz, K. 1994. FAST: A functional algorithm simulation testbed. In International Workshop on Modelling, Analysis and Simulation of Computer and Telecommunication Systems---MACSOTS '94. 142--146.]]
[20]
Dubois, M. and Briggs, F. A. 1982. Performance of synchronized iterative processes in multiprocessor systems. IEEE Trans. Softw. Eng. SE-8, 4 (July), 419--431.]]
[21]
Eager, D. L., Zahorjan, J., and Lazowska, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. C-38, 3 (Mar.), 408--423.]]
[22]
Fahringer, T. 1993. Automatic performance prediction for parallel programs on massively parallel computers. Ph.D. dissertation, Univ. Vienna.]]
[23]
Fahringer, T. and Zima, H. 1993. A static parameter-based performance prediction tool for parallel programs. In Proceedings of the 1993 ACM International Conference on Supercomputing. (Tokyo, Japan).]]
[24]
Flatt, H. 1984. A simple model of parallel processing. IEEE Comput. 17, 95.]]
[25]
Flatt, H. and Kennedy, K. 1989. Performance of parallel processors. Paral. Comput. 12, 1--20.]]
[26]
Frank, M., Vernon, M., and Agarwal, A. 1997. LoPC: Modeling contention in parallel algorithms. In Proceedings of the 6th Annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Las Vegas, Nev.). ACM, New York.]]
[27]
Gross, D. and Harris, C. M. 1985. Fundamentals of Queueing Theory, 2nd ed. Wiley, New York.]]
[28]
Gustafson, J. L. 1988. Reevaluating Amdahl's law. Commun. ACM 31, 5 (May).]]
[29]
Hack, J. 1989. On the promise of general-purpose parallel computing. Paral. Comput. 10, 261--275.]]
[30]
Hartleb, F. and Mertsiotakis, V. 1992. Bounds for the mean runtime of parallel programs. In Proceedings of the 6th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation. pp. 197--210.]]
[31]
Harzallah, K. and Sevcik, K. C. 1995. Predicting application behavior in large-scale shared memory multiprocessors. In Proceedings of the Supercomputing '95 (San Diego, Calif.).]]
[32]
Heidelberger, P. and Trivedi, K. S. 1983. Analytic queueing models for programs with internal concurrency. IEEE Trans. Comput. C-32, 1 (Jan.), 73--82.]]
[33]
Horowitz, E. and Sahni, S. 1984. Fundamentals of Computer Algorithms. Computer Science Press International, Inc., Rockville, Md.]]
[34]
Jonkers, H., van Gemund, A. J., and Reijns, G. L. 1995. A probabilistic approach to parallel system performance modelling. In Proceedings of the 28th Annual Hawaii International Conference on System Sciences. pp. II (Software Technology).]]
[35]
Kapelnikov, A., Muntz, R. R., and Ercegovac, M. D. 1989. A modeling methodology for the analysis of concurrent systems and computations. J. Paral. Dist. Comput. 6, 568--597.]]
[36]
Kruskal, C. P. and Weiss, A. 1985. Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. SE-11, 10 (Oct.), 1001--1016.]]
[37]
Kumar, M. 1988. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE Trans. Comput. 37, 9 (Sept.), 1088--1098.]]
[38]
Larus, J. R. 1993. Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Paral. Dist. Syst. 4, 7 (July), 812--826.]]
[39]
Lazowska, E. D., Zahorjan, J., Graham, G. S., and Sevcik, K. C. 1984. Quant. Syst. Perf. Prentice-Hall, Englewood Cliffs. N.J.]]
[40]
Lewandowski, G., Condon, A., and Bach, E. 1996. Asynchronous analysis of parallel dynamic programming algorithms. IEEE Trans. Paral. Dist. Syst. 7, 4 (April), 425--438.]]
[41]
Liang, D.-R. and Tripathi, S. K. 2000. On performance prediction of parallel computations with precedence constraints. IEEE Trans. Paral. Dist. Syst. 11, 5 (May), 491--508.]]
[42]
Madala, S. and Sinclair, J. B. 1991. Performance of synchronous parallel algorithms with regular structures. IEEE Trans. Paral. Dist. Syst. 2, 1 (Jan.), 105--116.]]
[43]
Mak, V. W. and Lundstrom, S. F. 1990. Predicting performance of parallel computations. IEEE Trans. Paral. Dist. Syst. 1, 3 (July), 257--270.]]
[44]
Mendes, C. and Reed, D. 1998. Integrated compilation and scalability analysis for parallel systems. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.]]
[45]
Mohan, J. 1984. Performance of parallel programs: Model and analyses. Ph.D. dissertation, Carnegie Mellon Univ.]]
[46]
Narendran, B. and Tiwari, P. 1992. Polynomial root-finding: Analysis and computational investigation. In Proceedings of the 4th Annual Symposium on Parallel Algorithms and Architectures. ACM, New York.]]
[47]
Parashar, M., Hariri, S., Haupt, T., and Fox, G. 1994. Interpreting the performance of HPF/Fortran 90D. In Proceedings of Supercomputing '94 (Washington, D.C).]]
[48]
Sarkar, V. 1989. Determining average program execution times and their variance. In Proceedings of the 1989 SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York.]]
[49]
Schopf, J. M. 1997. Structural prediction models for high-performance distributed applications. In Proceedings of the Cluster Computing Conference (Atlanta, Ga.).]]
[50]
Schopf, J. M. and Berman, F. 2001. Using stochastic information to predict application behavior on contended resources. Int. J. Found. Comput. Sci. 12, 3 (Jun), 341--364.]]
[51]
Singh, J. P., Weber, W.-D., and Gupta, A. 1992. SPLASH: Stanford parallel applications for shared-memory. Comput. Archit. News 20, 1 (Mar.), 5--44.]]
[52]
Sundaram-Stukel, D. and Vernon, M. 1999. Predictive analysis of a wavefront application using LogGP. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. (Atlanta, Ca.).]]
[53]
Thomasian, A. and Bay, P. F. 1986. Analytic queueing network models for parallel processing of task systems. IEEE Trans. Comput. C-35, 12 (Dec.), 1045--1054.]]
[54]
Towsley, D., Rommel, G., and Stankovic, J. A. 1990. Analysis of fork-join program response times on multiprocessors. IEEE Trans. Paral. Dist. Syst. 1, 3.]]
[55]
Tsai, J. and Agarwal, A. 1993. Analyzing multiprocessor cache behavior through data reference modeling. In Proceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York.]]
[56]
Tsuei, T.-F. and Vernon, M. K. 1990. Diagnosing parallel program speedup limitations using resource contention models. In Proceedings of the 1990 International Conference Parallel Processing.]]
[57]
Tsuei, T.-F. and Vernon, M. K. 1992. A multiprocessor bus design model validated by system measurement. IEEE Trans. Paral. Dist. Syst. 3, 6, 712--727.]]
[58]
van Gemund, A. J. 1996. Performance modeling of parallel systems. Ph.D. dissertation, Delft Univ. Technology.]]
[59]
van Gemund, A. J. 2003. Symbolic performance modeling of parallel systems. IEEE Trans. Paral. Dist. Syst. 14, 2 (Feb.), 154--165.]]
[60]
Vernon, M. K., Lazowska, E. D., and Zahorjan, J. 1988. An accurate and efficient performance analysis technique for multiprocessor snooping cache-consistency protocols. In Proceedings of the 15th International Symposium on Computer Architecture.]]
[61]
Vrsalovic, D. F., Siewiorek, D. P., Segall, Z. Z., and Gehringer, E. F. 1988. Performance prediction and calibration for a class of multiprocessors. IEEE Trans. Comput. 37, 11 (Nov.), 1353--1365.]]
[62]
Wang, K.-Y. 1994. Precise compile-time performance prediction for superscalar-based computers. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation. (Orlando, Fla.). ACM, New York, pp. 73--84.]]
[63]
Willick, D. L. and Eager, D. L. 1990. an analytic model of multistage interconnection networks. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, 192--202.]]
[64]
Xu, Z., Zhang, X., and Sun, L. 1996. Semi-empirical multiprocessor performance predictions. J. Paral. Dist. Comput. 39, 1 (Jan.).]]
[65]
Yazici-Pekergin, N. and Vincent, J.-M. 1991. Stochastic bounds on execution times of parallel programs. IEEE Trans. Softw. Eng. 17, 10 (Oct.), 1005--1012.]]

Cited By

View all

Recommendations

Reviews

Gregory D Peterson

Analytic models for predicting the performance of parallel applications can provide invaluable insight to evaluate efficiency and identify bottlenecks. Adve and Vernon have developed a remarkably accurate performance model for parallel applications executing on dedicated, shared memory systems. The model has two levels: a lower-level queuing model to characterize the impact of contention and caching effects, and a higher-level task graph model of the application. The task graph is predefined for a specific set of inputs, and each task is assumed to have deterministic run times. The model is validated with five applications with different task graph structures, in order to illustrate the approach's accuracy and flexibility, particularly in comparison to previous work with stochastic models. The authors do an excellent job of developing the model, validating it, and interpreting the results. Perhaps because the paper took more than a decade to be published, the authors' claims that stochastic models have not been validated with real applications, and claims about restrictions on their form, are not necessarily accurate (see Atallah [1], Chamberlain and Franklin [2], Noble et al. [3], and Peterson and Chamberlain [4]). Nonetheless, the paper makes an important contribution to our understanding of application performance on dedicated, shared-memory platforms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 22, Issue 1
February 2004
136 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/966785
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2004
Published in TOCS Volume 22, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Analytical model
  2. deterministic model
  3. parallel program performance prediction
  4. queueing network
  5. shared memory
  6. task graph
  7. task scheduling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)7
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media