skip to main content
10.1145/3243127.3243132acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

A language-agnostic model for semantic source code labeling

Published: 03 September 2018 Publication History

Abstract

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

References

[1]
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2017. A survey of machine learning for big code and naturalness. arXiv preprint arXiv:1709.06182 (2017).
[2]
John R Anderson, Daniel Bothell, Michael D Byrne, Scott Douglass, Christian Lebiere, and Yulin Qin. 2004. An integrated theory of the mind. Psychological review 111, 4 (2004), 1036.
[3]
Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications. ACM, 681–682.
[4]
Michael W Berry, Susan T Dumais, and Gavin W O’Brien. 1995. Using linear algebra for intelligent information retrieval. SIAM review 37, 4 (1995), 573–595.
[5]
Black Duck. 2017. Open Hub. https://www.openhub.net
[6]
Francisco Charte, Antonio J Rivera, María J del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3–16. A Language-Agnostic Model for Semantic Source Code Labeling MASES ’18, September 3, 2018, Montpellier, France
[7]
Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).
[8]
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. arXiv preprint arXiv:1612.08083 (2016).
[9]
Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2014. Labeling source code with information retrieval methods: an empirical study. Empirical Software Engineering 19, 5 (2014), 1383– 1420.
[10]
Amit Deshpande and Dirk Riehle. 2008. The total growth of open source. In IFIP International Conference on Open Source Systems. Springer, 197–209.
[11]
Malcom Gethers, Trevor Savage, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2011. CodeTopics: which topic am I coding now?. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 1034–1036.
[12]
GitHub. 2017. GitHub. https://github.com
[13]
A Grabowski, N Kruszewska, and RA Kosiński. 2008. Properties of on-line social systems. The European Physical Journal B-Condensed Matter and Complex Systems 66, 1 (2008), 107–113.
[14]
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[16]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2015. Characteraware neural language models. arXiv preprint arXiv:1508.06615 (2015).
[17]
Krugle. 2017. krugle. http://opensearch.krugle.org
[18]
Adrian Kuhn, Stéphane Ducasse, and Tudor Gírba. 2007. Semantic clustering: Identifying topics in source code. Information and Software Technology 49, 3 (2007), 230–243.
[19]
Darren Kuo. 2011. On word prediction methods. Technical Report. Technical report, EECS Department, University of California, Berkeley.
[20]
Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel Ossher. 2007. CodeGenie:: a tool for test-driven source code search. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. ACM, 917–918.
[21]
Bennet P Lientz, E. Burton Swanson, and Gail E Tompkins. 1978. Characteristics of application software maintenance. Commun. ACM 21, 6 (1978), 466–471.
[22]
Yoelle S Maarek, Daniel M Berry, and Gail E Kaiser. 1991. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on software Engineering 17, 8 (1991), 800–813.
[23]
Jon D Mcauliffe and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems. 121–128.
[24]
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Chen Fu, and Qing Xie. 2012. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering 38, 5 (2012), 1069–1087.
[25]
Stack Overflow. 2017. Stack Overflow. http://stackoverflow.com
[26]
Santanu Paul and Atul Prakash. 1994. A framework for source code search using program patterns. IEEE Transactions on Software Engineering 20, 6 (1994), 463–475.
[27]
Steven P Reiss. 2009. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 243– 253.
[28]
Joshua Saxe, Rafael Turner, and Kristina Blokhin. 2014. CrowdSource: Automated inference of high level malware functionality from low-level symbols using a crowd trained machine learning model. In Malicious and Unwanted Software: The Americas (MALWARE), 2014 9th International Conference on. IEEE, 68–75.
[29]
searchcode. 2017. searchcode. https://searchcode.com
[30]
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 145–158.
[31]
SourceForge. 2017. SourceForge. https://sourceforge.net
[32]
Sourcegraph. 2017. Sourcegraph. https://sourcegraph.com
[33]
Clayton Stanley and Michael D Byrne. 2013. Predicting tags for stackoverflow posts. In Proceedings of ICCM, Vol. 2013.
[34]
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling. In Interspeech. 194–197.
[35]
Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. Validating the use of topic models for software evolution. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on. IEEE, 55–64.
[36]
Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2009. Mining multi-label data. In Data mining and knowledge discovery handbook. Springer, 667–685.
[37]
Xi-Zhu Wu and Zhi-Hua Zhou. 2016. A Unified View of Multi-Label Performance Measures. arXiv preprint arXiv:1609.00288 (2016).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MASES 2018: Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis
September 2018
52 pages
ISBN:9781450359726
DOI:10.1145/3243127
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crowdsourcing
  2. deep learning
  3. multilabel classification
  4. natural language processing
  5. semantic labeling
  6. source code

Qualifiers

  • Research-article

Conference

ASE '18
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media