short-paper

Sosed: a tool for finding similar software projects

Authors:

Egor Bogomolov,

Yaroslav Golubev,

Artyom Lobanov,

Vladimir Kovalenko,

Timofey BryksinAuthors Info & Claims

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

Pages 1316 - 1320

https://doi.org/10.1145/3324884.3415291

Published: 27 January 2021 Publication History

Abstract

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output.

Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

References

[1]

2019. The State of the Octoverse. https://octoverse.github.com/

[2]

2020. go-enry GitHub: enry. https://github.com/go-enry/enry

[3]

2020. JetBrains Research GitHub: Buckwheat. https://github.com/JetBrains-Research/buckwheat

[4]

2020. JetBrains Research GitHub: Sosed. https://github.com/JetBrains-Research/sosed

[5]

2020. Pygments: Python syntax highlighter. https://pygments.org/

[6]

2020. The most popular languages of GitHub's pull requests, 1 quarter, 2020. https://madnight.github.io/githut/#/pull_requests/2020/1

[7]

2020. tree-sitter GitHub: tree-sitter. https://github.com/tree-sitter/tree-sitter

[8]

Uri Alon, Omer Levy, and Eran Yahav. 2018. code2seq: Generating Sequences from Structured Representations of Code. CoRR abs/1808.01400 (2018). arXiv:1808.01400 http://arxiv.org/abs/1808.01400

[9]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, null (March 2003), 993--1022.

[10]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[11]

Ning Chen, Steven Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A Framework for Detecting Similar Mobile Applications by Online Kernel Learning. WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Search and Data Mining.

Digital Library

[12]

Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 41 (1990), 391--407.

[13]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 226--231.

Digital Library

[14]

Hugo Gonzalez, Natalia Stakhanova, and Ali Ghorbani. 2014. DroidKin: Lightweight Detection of Android Apps Similarity, Vol. 152.

[15]

Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2020. Global Relational Models of Source Code. In International Conference on Learning Representations. https://openreview.net/forum?id=B1lnbRNtwr

[16]

Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-Means Clustering. Journal of Statistical Software 50 (09 2012), 1--22.

[17]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).

[18]

S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency. Ann. Math. Statist. 22, 1 (03 1951), 79--86.

[19]

L. Li, T. F. Bissyandé, and J. Klein. 2017. SimiDroid: Identifying and Explaining Similarities in Android Apps. In 2017 IEEE Trustcom/BigDataSE/ICESS. 136--143.

[20]

M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk. 2016. On automatically detecting similar Android apps. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). 1--10.

[21]

S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (mar 1982), 129--137.

Digital Library

[22]

Vadim Markovtsev. 2017. GitHub word2vec 120k. https://data.world/vmarkovtsev/github-word-2-vec-120-k.

[23]

Vadim Markovtsev and Eiso Kant. 2017. Topic modeling of public repositories at scale using names in source code. arXiv preprint arXiv:1704.00135 (2017).

[24]

Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. Detecting Similar Software Applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE '12). IEEE Press, 364--374.

Digital Library

[25]

Tom Mens, Alexander Serebrenik, and Anthony Cleve. 2014. Evolving Software Systems. Springer Publishing Company, Incorporated.

[26]

Martin F Porter. 2001. Snowball: A language for stemming algorithms.

[27]

Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 298--307.

[28]

Xiaobing Sun, Xiangyue Liu, Li Bin, Yucong Duan, Hui Yang, and Jiajun Hu. 2016. Exploring topic models in software engineering data analysis: A survey. 357--362.

[29]

F. Thung, D. Lo, and L. Jiang. 2012. Detecting similar applications with collaborative tagging. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). 600--603.

[30]

Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 2 (2001), 411--423. arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293

[31]

Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Machine Learning 101, 1 (01 Oct 2015), 303--323.

Digital Library

[32]

Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, and Jianling Sun. 2017. Detecting similar repositories on GitHub. 13--23.

Cited By

Valenzuela-Toledo PBergel AKehrer TNierstrasz O(2023)EGAD: A moldable tool for GitHub Action analysis2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00044(260-264)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00044
Ullah FNaeem MNaeem HCheng XAlazab M(2022)CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM modelInternational Journal of Intelligent Systems10.1002/int.2281337:9(5768-5795)Online publication date: 9-Jan-2022
https://doi.org/10.1002/int.22813

Recommendations

A Measurement-based Study on Application Popularity in Android and iOS App Stores
Mobidata '15: Proceedings of the 2015 Workshop on Mobile Big Data

Mobile application stores (appstores) are emerging digital distribution platforms with explosive growth. Although there have been some observations on the mobile application (app) popularity in Android appstores, there is no report on the app popularity ...
CASE tools: understanding the reasons for non-use

Computer-Aided Software Engineering (CASE) technologies are tools that provide automated assistance for software development [3]. The goal of introducing CASE tools is the reduction of the time and cost of software development and the enhancement of the ...
Beginning iPhone SDK Programming with Objective-C

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

December 2020

1449 pages

ISBN:9781450367684

DOI:10.1145/3324884

General Chair:
John Grundy,
Program Chairs:
Claire Le Goues,
David Lo

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper

Conference

ASE '20

Sponsor:

ASE '20: 35th IEEE/ACM International Conference on Automated Software Engineering

December 21 - 25, 2020

Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
90
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Valenzuela-Toledo PBergel AKehrer TNierstrasz O(2023)EGAD: A moldable tool for GitHub Action analysis2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00044(260-264)Online publication date: May-2023
https://doi.org/10.1109/MSR59073.2023.00044
Ullah FNaeem MNaeem HCheng XAlazab M(2022)CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM modelInternational Journal of Intelligent Systems10.1002/int.2281337:9(5768-5795)Online publication date: 9-Jan-2022
https://doi.org/10.1002/int.22813

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents