research-article

A language-agnostic model for semantic source code labeling

Authors:

David SlaterAuthors Info & Claims

MASES 2018: Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis

Pages 36 - 44

https://doi.org/10.1145/3243127.3243132

Published: 03 September 2018 Publication History

Abstract

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

References

[1]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2017. A survey of machine learning for big code and naturalness. arXiv preprint arXiv:1709.06182 (2017).

[2]

John R Anderson, Daniel Bothell, Michael D Byrne, Scott Douglass, Christian Lebiere, and Yulin Qin. 2004. An integrated theory of the mind. Psychological review 111, 4 (2004), 1036.

[3]

Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications. ACM, 681–682.

Digital Library

[4]

Michael W Berry, Susan T Dumais, and Gavin W O’Brien. 1995. Using linear algebra for intelligent information retrieval. SIAM review 37, 4 (1995), 573–595.

Digital Library

[5]

Black Duck. 2017. Open Hub. https://www.openhub.net

[6]

Francisco Charte, Antonio J Rivera, María J del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3–16. A Language-Agnostic Model for Semantic Source Code Labeling MASES ’18, September 3, 2018, Montpellier, France

Digital Library

[7]

Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).

[8]

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. arXiv preprint arXiv:1612.08083 (2016).

[9]

Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2014. Labeling source code with information retrieval methods: an empirical study. Empirical Software Engineering 19, 5 (2014), 1383– 1420.

Digital Library

[10]

Amit Deshpande and Dirk Riehle. 2008. The total growth of open source. In IFIP International Conference on Open Source Systems. Springer, 197–209.

[11]

Malcom Gethers, Trevor Savage, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2011. CodeTopics: which topic am I coding now?. In Proceedings of the 33rd International Conference on Software Engineering. ACM, 1034–1036.

Digital Library

[12]

GitHub. 2017. GitHub. https://github.com

[13]

A Grabowski, N Kruszewska, and RA Kosiński. 2008. Properties of on-line social systems. The European Physical Journal B-Condensed Matter and Complex Systems 66, 1 (2008), 107–113.

[14]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.

Digital Library

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[16]

Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2015. Characteraware neural language models. arXiv preprint arXiv:1508.06615 (2015).

Digital Library

[17]

Krugle. 2017. krugle. http://opensearch.krugle.org

[18]

Adrian Kuhn, Stéphane Ducasse, and Tudor Gírba. 2007. Semantic clustering: Identifying topics in source code. Information and Software Technology 49, 3 (2007), 230–243.

Digital Library

[19]

Darren Kuo. 2011. On word prediction methods. Technical Report. Technical report, EECS Department, University of California, Berkeley.

[20]

Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel Ossher. 2007. CodeGenie:: a tool for test-driven source code search. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. ACM, 917–918.

Digital Library

[21]

Bennet P Lientz, E. Burton Swanson, and Gail E Tompkins. 1978. Characteristics of application software maintenance. Commun. ACM 21, 6 (1978), 466–471.

Digital Library

[22]

Yoelle S Maarek, Daniel M Berry, and Gail E Kaiser. 1991. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on software Engineering 17, 8 (1991), 800–813.

Digital Library

[23]

Jon D Mcauliffe and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems. 121–128.

[24]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Chen Fu, and Qing Xie. 2012. Exemplar: A source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering 38, 5 (2012), 1069–1087.

Digital Library

[25]

Stack Overflow. 2017. Stack Overflow. http://stackoverflow.com

[26]

Santanu Paul and Atul Prakash. 1994. A framework for source code search using program patterns. IEEE Transactions on Software Engineering 20, 6 (1994), 463–475.

Digital Library

[27]

Steven P Reiss. 2009. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 243– 253.

Digital Library

[28]

Joshua Saxe, Rafael Turner, and Kristina Blokhin. 2014. CrowdSource: Automated inference of high level malware functionality from low-level symbols using a crowd trained machine learning model. In Malicious and Unwanted Software: The Americas (MALWARE), 2014 9th International Conference on. IEEE, 68–75.

[29]

searchcode. 2017. searchcode. https://searchcode.com

[30]

Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 145–158.

Digital Library

[31]

SourceForge. 2017. SourceForge. https://sourceforge.net

[32]

Sourcegraph. 2017. Sourcegraph. https://sourcegraph.com

[33]

Clayton Stanley and Michael D Byrne. 2013. Predicting tags for stackoverflow posts. In Proceedings of ICCM, Vol. 2013.

[34]

Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling. In Interspeech. 194–197.

[35]

Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. Validating the use of topic models for software evolution. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on. IEEE, 55–64.

Digital Library

[36]

Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2009. Mining multi-label data. In Data mining and knowledge discovery handbook. Springer, 667–685.

[37]

Xi-Zhu Wu and Zhi-Hua Zhou. 2016. A Unified View of Multi-Label Performance Measures. arXiv preprint arXiv:1609.00288 (2016).

Cited By

Kong XLv ZChen CChang HLi NZhang F(2024)Code Recommendation for Schema Evolution of Mimic Storage SystemsInternational Journal of Software Engineering and Knowledge Engineering10.1142/S0218194024500499(1-22)Online publication date: 28-Oct-2024
https://doi.org/10.1142/S0218194024500499
Tsai MLin CHe ZYang WLei C(2023)PowerDP: De-Obfuscating and Profiling Malicious PowerShell Commands With Multi-Label ClassifiersIEEE Access10.1109/ACCESS.2022.323250511(256-270)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232505
Shafiq SMashkoor AMayr-Dorn CEgyed A(2021)A Literature Review of Using Machine Learning in Software Development Life Cycle StagesIEEE Access10.1109/ACCESS.2021.31197469(140896-140920)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3119746

Index Terms

A language-agnostic model for semantic source code labeling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Do bugs lead to unnaturalness of source code?
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent ...
A Convolutional Neural Network for Language-Agnostic Source Code Summarization
ENASE 2019: Proceedings of the 14th International Conference on Evaluation of Novel Approaches to Software Engineering

Descriptive comments play a crucial role in the software engineering process. They decrease development time, enable better bug detection, and facilitate the reuse of previously written code. However, comments are commonly the last of a software ...
Code semantic enrichment for deep code search
Abstract
Code search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers’ query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and ...
Graphical abstract

Display Omitted
Highlights
- Finding that the code semantics can be enriched by incorporating with the description of its most similar code.
- Proposing a code semantic enrichment approach named SemEnr for deep code search.
- Evaluating the performance of SemEnr ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MASES 2018: Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis

September 2018

52 pages

ISBN:9781450359726

DOI:10.1145/3243127

Copyright © 2018 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence
CNRS: Centre National De La Rechercue Scientifique
SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASE '18

Sponsor:

SIGAI
CNRS
SIGSOFT
IEEE-CS

ASE '18: 33rd ACM/IEEE International Conference on Automated Software Engineering

September 3, 2018

Montpellier, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kong XLv ZChen CChang HLi NZhang F(2024)Code Recommendation for Schema Evolution of Mimic Storage SystemsInternational Journal of Software Engineering and Knowledge Engineering10.1142/S0218194024500499(1-22)Online publication date: 28-Oct-2024
https://doi.org/10.1142/S0218194024500499
Tsai MLin CHe ZYang WLei C(2023)PowerDP: De-Obfuscating and Profiling Malicious PowerShell Commands With Multi-Label ClassifiersIEEE Access10.1109/ACCESS.2022.323250511(256-270)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232505
Shafiq SMashkoor AMayr-Dorn CEgyed A(2021)A Literature Review of Using Machine Learning in Software Development Life Cycle StagesIEEE Access10.1109/ACCESS.2021.31197469(140896-140920)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3119746

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents