research-article

Using web corpus statistics for program analysis

Authors:

Chun-Hung Hsiao,

Michael Cafarella,

Satish NarayanasamyAuthors Info & Claims

OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications

Pages 49 - 65

https://doi.org/10.1145/2660193.2660226

Published: 15 October 2014 Publication History

Abstract

Several program analysis tools - such as plagiarism detection and bug finding - rely on knowing a piece of code's relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common a line of code is. Natural language processing researchers have encountered a similar problem and addressed it using an n-gram model of text frequency, derived from statistics computed over text corpora.

We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as a subgraph of the program dependence graph that contains all nodes and edges reachable in n steps from the statement. We can count n-grams in a program and count the frequency of n-grams in the corpus, enabling us to compute tf-idf-style measures that capture the differing importance of different lines of code. We demonstrate the power of this approach by implementing a plagiarism detector with accuracy that beats previous techniques, and a bug-finding tool that discovered over a dozen previously unknown bugs in a collection of real deployed programs.

References

[1]

A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools: Second Edition. Addison-Wesley, 2007.

Digital Library

[2]

T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In EMNLP-CoNLL, pages 858--867, 2007.

[3]

D. Cai and M. Kim. An empirical study of long-lived code clones. In FASE, pages 432--446, 2011.

Digital Library

[4]

W. S. Evans, C. W. Fraser, and F. Ma. Clone detection via structural abstraction. Software Quality Journal, 17(4):309--330, 2009.

Digital Library

[5]

J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319--349, July 1987. ISSN 0164-0925. URL http://doi.acm.org/10.1145/24039. 24041.

Digital Library

[6]

M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In Software Engineering, 2008. ICSE'08. ACM/IEEE 30th International Conference on, pages 321--330. IEEE, 2008.

Digital Library

[7]

P. Green, P. C. Lane, A. Rainer, S. Bennett, and S.-B. Scholz. Same difference: Detecting collusion by finding unusual shared elements. In Proceedings of the Fifth International Plagiarism Conference, 2012.

[8]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE '12, pages 837--847, Piscataway, NJ, USA, 2012. IEEE Press. ISBN 978-1-4673-1067-3. URL http://dl.acm.org/citation.cfm?id=2337223.2337322.

Digital Library

[9]

A. Islam and D. Inkpen. Real-word spelling correction using google web 1tn-gram data set. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM, pages 1689--1692, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-512-3. URL http://doi.acm.org/10.1145/1645953.1646205.

Digital Library

[10]

L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering, ICSE '07, pages 96--105, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 0-7695-2828-7. URL http://dx.doi.org/10.1109/ICSE.2007.30.

Digital Library

[11]

E. Juergens, F. Deissenboeck, B. Hummel, and S.Wagner. Do code clones matter? In Proceedings of the 31st International Conference on Software Engineering, ICSE '09, pages 485--495, Washington, DC, USA, 2009. IEEE Computer Society. ISBN 978-1-4244-3453-4. URL http://dx.doi.org/10.1109/ICSE.2009.5070547.

Digital Library

[12]

W. M. Khoo, A. Mycroft, and R. Anderson. Rendezvous: a search engine for binary code. In Proceedings of the Tenth International Workshop on Mining Software Repositories, pages 329--338. IEEE Press, 2013.

Digital Library

[13]

R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Static Analysis, pages 40--56. Springer, 2001.

Digital Library

[14]

Z. Li, S. Lu, S. Myagmar, and Y. Zhou. Cp-miner: finding copy-paste and related bugs in large-scale software code. Software Engineering, IEEE Transactions on, 32(3):176--192, 2006. ISSN 0098-5589.

Digital Library

[15]

B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, PLDI '05, pages 15--26, New York, NY, USA, 2005. ACM. ISBN 1-59593-056-6. URL http://doi.acm.org/10.1145/1065010.1065014.

Digital Library

[16]

C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In KDD, pages 872--881, 2006.

Digital Library

[17]

B. Lucia, B. P. Wood, and L. Ceze. Isolating and understanding concurrency errors using reconstructed execution fragments. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 378--388, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0663-8. URL http://doi.acm.org/10.1145/1993498.1993543.

Digital Library

[18]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, 2008.

Digital Library

[19]

J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331:176--, Jan. 2011.

[20]

S. S. Muchnick. Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-320-4.

Digital Library

[21]

T. T. Nguyen, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. A statistical semantic language model for source code. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013, pages 532--542, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9. URL http://doi.acm.org/10.1145/2491411.2491458.

Digital Library

[22]

L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with jplag. Journal of Universal Computer Science, 8:1016--1038, 2001.

[23]

M. Ravallion. The two poverty enlightenments: Historical insights from digitized books spanning three centuries. Poverty and Public Policy, 3(2):1--46, 2011. ISSN 1944-2858.

[24]

V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, page 44. ACM, 2014.

Digital Library

[25]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD Conference, pages 76--85, 2003.

Digital Library

[26]

K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. In P.Willett, editor, Document retrieval systems, pages 132--142. Taylor Graham Publishing, London, UK, UK, 1988. ISBN 0-947568-21-2.

Digital Library

[27]

L. Thomas, S. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. In Data Mining, 2006. ICDM '06. Sixth International Conference on, pages 1097--1101, 2006.

Digital Library

[28]

D. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 976--985, 2007.

[29]

X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD '04, pages 335--346, New York, NY, USA, 2004. ACM. ISBN 1-58113-859-8. URL http://doi.acm.org/10.1145/1007568.1007607.

Digital Library

[30]

L. Zou, L. Chen, J. X. Yu, and Y. Lu. A novel spectral coding in a large graph database. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology, EDBT '08, pages 181--192, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-926-5. URL http://doi.acm.org/10.1145/1353343.1353369.

Digital Library

[31]

L. Zou, L. Chen, andM. T. Özsu. Distance-join: pattern match query in a large graph database. Proc. VLDB Endow., 2(1): 886--897, Aug. 2009. ISSN 2150-8097. URL http://dl.acm.org/citation.cfm?id=1687627.1687727.

Digital Library

Cited By

Reitinger NMazurek M(2021)ML-CB: Machine Learning Canvas BlockProceedings on Privacy Enhancing Technologies10.2478/popets-2021-00562021:3(453-473)Online publication date: 27-Apr-2021
https://doi.org/10.2478/popets-2021-0056
Nguyen TNguyen T(2021)PERSONA: A personalized model for code recommendationPLOS ONE10.1371/journal.pone.025983416:11(e0259834)Online publication date: 16-Nov-2021
https://doi.org/10.1371/journal.pone.0259834
Petrik JChuda D(2021)The effect of time drift in source code authorship attributionProceedings of the 22nd International Conference on Computer Systems and Technologies10.1145/3472410.3472445(87-92)Online publication date: 18-Jun-2021
https://dl.acm.org/doi/10.1145/3472410.3472445
Show More Cited By

Index Terms

Using web corpus statistics for program analysis
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program analysis
      2. Program verification
    2. Program semantics

Recommendations

Using web corpus statistics for program analysis
OOPSLA '14

Several program analysis tools - such as plagiarism detection and bug finding - rely on knowing a piece of code's relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have an identical simple ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Corpus-Based statistics of pre-qin chinese
CLSW'12: Proceedings of the 13th Chinese conference on Chinese Lexical Semantics

The Pre-Qin Chinese plays a key role in the history of Chinese. However, for the lack of annotated corpus, the overview of Pre-Qin Chinese vocabulary is still not clear. This paper introduces the corpus of 25 Pre-Qin classical texts, which are under ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications

October 2014

946 pages

ISBN:9781450325851

DOI:10.1145/2660193

General Chair:
Andrew Black
Portland State University, USA
,
Program Chair:
Todd Millstein
University of California, Los Angeles, USA

ACM SIGPLAN Notices Volume 49, Issue 10
OOPSLA '14
October 2014
907 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2714064
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SPLASH '14

Sponsor:

SIGPLAN

SPLASH '14: Conference on Systems, Programming, and Applications: Software for Humanity

October 20 - 24, 2014

Oregon, Portland, USA

Acceptance Rates

OOPSLA '14 Paper Acceptance Rate 52 of 186 submissions, 28%;

Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
491
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Reitinger NMazurek M(2021)ML-CB: Machine Learning Canvas BlockProceedings on Privacy Enhancing Technologies10.2478/popets-2021-00562021:3(453-473)Online publication date: 27-Apr-2021
https://doi.org/10.2478/popets-2021-0056
Nguyen TNguyen T(2021)PERSONA: A personalized model for code recommendationPLOS ONE10.1371/journal.pone.025983416:11(e0259834)Online publication date: 16-Nov-2021
https://doi.org/10.1371/journal.pone.0259834
Petrik JChuda D(2021)The effect of time drift in source code authorship attributionProceedings of the 22nd International Conference on Computer Systems and Technologies10.1145/3472410.3472445(87-92)Online publication date: 18-Jun-2021
https://dl.acm.org/doi/10.1145/3472410.3472445
Zhang FChen BLi RPeng X(2021)A hybrid code representation learning approach for predicting method namesJournal of Systems and Software10.1016/j.jss.2021.111011180:COnline publication date: 1-Oct-2021
https://dl.acm.org/doi/10.1016/j.jss.2021.111011
Shedko APalachev IKvochko ASemenov ASun K(2020)Applying probabilistic models to C++ code on an industrial scaleProceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops10.1145/3387940.3391477(595-602)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3387940.3391477
Le TChen HBabar M(2020)Deep Learning for Source Code Modeling and GenerationACM Computing Surveys10.1145/338345853:3(1-38)Online publication date: 12-Jun-2020
https://dl.acm.org/doi/10.1145/3383458
Bryksin TPetukhov VAlexin IPrikhodko SShpilman AKovalenko VPovarov N(2020)Using Large-Scale Anomaly Detection on Code to Improve Kotlin CompilerProceedings of the 17th International Conference on Mining Software Repositories10.1145/3379597.3387447(455-465)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3379597.3387447
Ben-Nun TJakobovits AHoefler T(2018)Neural code comprehensionProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327276(3589-3601)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327144.3327276
Jimenez MMaxime CLe Traon YPapadakis M(2018)On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)10.1109/ICSME.2018.00053(437-448)Online publication date: Sep-2018
https://doi.org/10.1109/ICSME.2018.00053
Sun XFu YDong YLiu ZZhang Y(2018)Improving Fitness Function for Language Fuzzing with PCFG Model2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC)10.1109/COMPSAC.2018.00098(655-660)Online publication date: Jul-2018
https://doi.org/10.1109/COMPSAC.2018.00098
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents