skip to main content
10.1145/3180155.3180252acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Public Access

Collective program analysis

Published: 27 May 2018 Publication History

Abstract

Popularity of data-driven software engineering has led to an increasing demand on the infrastructures to support efficient execution of tasks that require deeper source code analysis. While task optimization and parallelization are the adopted solutions, other research directions are less explored. We present collective program analysis (CPA), a technique for scaling large scale source code analyses, especially those that make use of control and data flow analysis, by leveraging analysis specific similarity. Analysis specific similarity is about, whether two or more programs can be considered similar for a given analysis. The key idea of collective program analysis is to cluster programs based on analysis specific similarity, such that running the analysis on one candidate in each cluster is sufficient to produce the result for others. For determining analysis specific similarity and clustering analysis-equivalent programs, we use a sparse representation and a canonical labeling scheme. Our evaluation shows that for a variety of source code analyses on a large dataset of programs, substantial reduction in the analysis time can be achieved; on average a 69% reduction when compared to a baseline and on average a 36% reduction when compared to a prior technique. We also found that a large amount of analysis-equivalent programs exists in large datasets.

References

[1]
Mithun Acharya, Tao Xie, Jian Pei, and Jun Xu. 2007. Mining API Patterns As Partial Orders from Source Code: From Usage Scenarios to Specifications. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC-FSE '07). ACM, New York, NY, USA, 25--34.
[2]
Frances E. Allen. 1970. Control Flow Analysis. In Proceedings of a Symposium on Compiler Optimization. ACM, New York, NY, USA, 1--19.
[3]
Nicholas Allen, Bernhard Scholz, and Padmanabhan Krishnan. 2015. Staged Points-to Analysis for Large Code Bases. In Compiler Construction. Springer Berlin Heidelberg, Berlin, Heidelberg, 131--150.
[4]
Darren C. Atkinson and William G. Griswold. 2001. Implementation Techniques for Efficient Data-Flow Analysis of Large Programs. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM'01) (ICSM '01). IEEE Computer Society, Washington, DC, USA, 52--.
[5]
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An Infrastructure for Large-scale Collection and Analysis of Open-source Code. Sci. Comput. Program. 79 (Jan. 2014), 241--259.
[6]
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '06). ACM, New York, NY, USA, 169--190.
[7]
Fraser Brown, Andres Nötzli, and Dawson Engler. 2016. How to Build Static Checking Systems Using Orders of Magnitude Less Code. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 143--157.
[8]
Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. 1991. Automatic Construction of Sparse Data Flow Evaluation Graphs. In Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '91). ACM, New York, NY, USA, 55--66.
[9]
Marco D'Ambros, Michele Lanza, and Romain Robbes. 2012. Evaluating Defect Prediction Approaches: A Benchmark and an Extensive Comparison. Empirical Softw. Engg. 17, 4--5 (Aug. 2012), 531--577.
[10]
John Demme and Simha Sethumadhavan. 2012. Approximate Graph Clustering for Program Characterization. ACM Trans. Archit. Code Optim. 8, 4, Article 21 (Jan. 2012), 21 pages.
[11]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2013. Boa: A Language and Infrastructure for Analyzing Ultra-large-scale Software Repositories. In Proceedings of the 2013 International Conference on Software Engineering (ICSE '13). IEEE Press, Piscataway, NJ, USA, 422--431. http://dl.acm.org/citation.cfm?id=2486788.2486844
[12]
Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N. Nguyen. 2015. Boa: Ultra-Large-Scale Software Repository and Source-Code Mining. ACM Trans. Softw. Eng. Methodol. 25, 1, Article 7 (Dec. 2015), 34 pages.
[13]
Robert Dyer, Hridesh Rajan, and Tien N. Nguyen. 2013. Declarative Visitors to Ease Fine-grained Source Code Mining with Full History on Billions of AST Nodes. In GPCE: the 12th International Conference on Generative Programming: Concepts & Experiences (GPCE). 23--32.
[14]
Rochelle Elva and Gary T. Leavens. 2012. Semantic Clone Detection Using Method IOE-behavior. In Proceedings of the 6th International Workshop on Software Clones (IWSC '12). IEEE Press, Piscataway, NJ, USA, 80--81. http://dl.acm.org/citation.cfm?id=2664398.2664414
[15]
Ansgar Fehnker, Ralf Huuck, and Wolf Rödiger. 2011. Model Checking Dataflow for Malicious Input. In Proceedings of the Workshop on Embedded Systems Security (WESS '11). ACM, New York, NY, USA, Article 4, 10 pages.
[16]
Andy Georges, Dries Buytaert, and Lieven Eeckhout. 2007. Statistically Rigorous Java Performance Evaluation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications (OOPSLA '07). ACM, New York, NY, USA, 57--76.
[17]
Patrice Godefroid, Aditya V. Nori, Sriram K. Rajamani, and Sai Deep Tetali. 2010. Compositional May-must Program Analysis: Unleashing the Power of Alternation. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '10). ACM, New York, NY, USA, 43--56.
[18]
Georgios Gousios. 2013. The GHTorent Dataset and Tool Suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR '13). IEEE Press, Piscataway, NJ, USA, 233--236. http://dl.acm.org/citation.cfm?id=2487085.2487132
[19]
Sulekha Kulkarni, Ravi Mangal, Xin Zhang, and Mayur Naik. 2016. Accelerating Program Analyses by Cross-program Training. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016). ACM, New York, NY, USA, 359--377.
[20]
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2006. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Trans. Softw. Eng. 32, 3 (March 2006), 176--192.
[21]
Benjamin Livshits and Thomas Zimmermann. 2005. DynaMine: Finding Common Error Patterns by Mining Software Revision Histories. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE-13). ACM, New York, NY, USA, 296--305.
[22]
Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. Detecting Similar Software Applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE '12). IEEE Press, Piscataway, NJ, USA, 364--374. http://dl.acm.org/citation.cfm?id=2337223.2337267
[23]
Hoan Anh Nguyen, Robert Dyer, Tien N. Nguyen, and Hridesh Rajan. 2014. Mining Preconditions of APIs in Large-scale Code Corpus. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, New York, NY, USA, 166--177.
[24]
Flemming Nielson, Hanne R. Nielson, and Chris Hankin. 1999. Principles of Program Analysis. Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[25]
Michael A Olson, Keith Bostic, and Margo I Seltzer. 1999. Berkeley DB. In USENIX Annual Technical Conference, FREENIX Track. 183--191.
[26]
Hridesh Rajan, Tien N. Nguyen, Gary T. Leavens, and Robert Dyer. 2015. Inferring Behavioral Specifications from Large-scale Repositories by Leveraging Collective Intelligence. In ICSE'15: The 37th International Conference on Software Engineering: NIER Track.
[27]
Thomas Reps, Susan Horwitz, and Mooly Sagiv. 1995. Precise Interprocedural Dataflow Analysis via Graph Reachability. In Proceedings of the 22Nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '95). ACM, New York, NY, USA, 49--61.
[28]
Chanchal K. Roy, James R. Cordy, and Rainer Koschke. 2009. Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Sci. Comput. Program. 74, 7 (May 2009), 470--495.
[29]
Yannis Smaragdakis, George Balatsouras, and George Kastrinis. 2013. Set-based Pre-processing for Points-to Analysis. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA '13). ACM, New York, NY, USA, 253--270.
[30]
Soot. 2015. Local May Alias Analysis. https://github.com/Sable/soot/.
[31]
Manu Sridharan, Satish Chandra, Julian Dolby, Stephen J. Fink, and Eran Yahav. 2013. Aliasing in Object-Oriented Programming. Springer-Verlag, Berlin, Heidelberg, Chapter Alias Analysis for Object-oriented Programs, 196--232. http://dl.acm.org/citation.cfm?id=2554511.2554523
[32]
Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser, and Tony Jebara. 2016. Code Relatives: Detecting Similarly Behaving Software. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 702--714.
[33]
F. H. Su, J. Bell, and G. Kaiser. 2016. Challenges in Behavioral Code Clone Detection. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 3. 21--22.
[34]
Suresh Thummalapenta and Tao Xie. 2009. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (ASE '09). IEEE Computer Society, Washington, DC, USA, 283--294.
[35]
Frank Tip. 1995. A Survey of Program Slicing Techniques. Journal of Programming Languages 3 (1995), 121--189.
[36]
Teck Bok Tok. 2007. Removing Unimportant Computations in Interprocedural Program Analysis. Ph.D. Dissertation. Austin, TX, USA. Advisor(s) Lin, Calvin. AAI3290942.
[37]
Emina Torlak and Satish Chandra. 2010. Effective Interprocedural Resource Leak Detection. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (ICSE '10). ACM, New York, NY, USA, 535--544.
[38]
Ganesha Upadhyaya and Hridesh Rajan. 2017. On Accelerating Source Code Analysis At Massive Scale. Technical Report TR17-02. Iowa State University. https://lib.dr.iastate.edu/cs_techreports/383/
[39]
Ganesha Upadhyaya and Hridesh Rajan. 2017. On Accelerating Ultra-large-scale Mining. In Proceedings of the 39th International Conference on Software Engineering: New Ideas and Emerging Results Track (ICSE-NIER '17). IEEE Press, Piscataway, NJ, USA, 39--42.
[40]
Andrzej Wasylkowski, Andreas Zeller, and Christian Lindig. 2007. Detecting Object Usage Anomalies. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC-FSE '07). ACM, New York, NY, USA, 35--44.
[41]
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering (ICSE '81). IEEE Press, Piscataway, NJ, USA, 439--449. http://dl.acm.org/citation.cfm?id=800078.802557
[42]
John Whaley and Martin Rinard. 1999. Compositional Pointer and Escape Analysis for Java Programs. In Proceedings of the 14th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA '99). ACM, New York, NY, USA, 187--206.
[43]
Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM '02). IEEE Computer Society, Washington, DC, USA, 721--. http://dl.acm.org/citation.cfm?id=844380.844811
[44]
Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are Code Examples on an Online Q&A Forum Reliable? A Study of API Misuse on Stack Overflow. In Proceedings of the 40th International Conference on Software Engineering (ICSE '18). ACM, New York, NY, USA.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '18: Proceedings of the 40th International Conference on Software Engineering
May 2018
1307 pages
ISBN:9781450356381
DOI:10.1145/3180155
  • Conference Chair:
  • Michel Chaudron,
  • General Chair:
  • Ivica Crnkovic,
  • Program Chairs:
  • Marsha Chechik,
  • Mark Harman
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Boa
  2. clustering
  3. source code analysis

Qualifiers

  • Research-article

Funding Sources

Conference

ICSE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)3
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media