skip to main content
10.1145/3324884.3415291acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
short-paper

Sosed: a tool for finding similar software projects

Published: 27 January 2021 Publication History

Abstract

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub projects in 200 languages. Then, we cluster embeddings to identify groups of semantically similar subtokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their subtokens. The tool receives an arbitrary project as input, extracts subtokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output.
Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of subtokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

References

[1]
2019. The State of the Octoverse. https://octoverse.github.com/
[2]
2020. go-enry GitHub: enry. https://github.com/go-enry/enry
[3]
2020. JetBrains Research GitHub: Buckwheat. https://github.com/JetBrains-Research/buckwheat
[4]
2020. JetBrains Research GitHub: Sosed. https://github.com/JetBrains-Research/sosed
[5]
2020. Pygments: Python syntax highlighter. https://pygments.org/
[6]
2020. The most popular languages of GitHub's pull requests, 1 quarter, 2020. https://madnight.github.io/githut/#/pull_requests/2020/1
[7]
2020. tree-sitter GitHub: tree-sitter. https://github.com/tree-sitter/tree-sitter
[8]
Uri Alon, Omer Levy, and Eran Yahav. 2018. code2seq: Generating Sequences from Structured Representations of Code. CoRR abs/1808.01400 (2018). arXiv:1808.01400 http://arxiv.org/abs/1808.01400
[9]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, null (March 2003), 993--1022.
[10]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[11]
Ning Chen, Steven Hoi, Shaohua Li, and Xiaokui Xiao. 2015. SimApp: A Framework for Detecting Similar Mobile Applications by Online Kernel Learning. WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Search and Data Mining.
[12]
Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 41 (1990), 391--407.
[13]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 226--231.
[14]
Hugo Gonzalez, Natalia Stakhanova, and Ali Ghorbani. 2014. DroidKin: Lightweight Detection of Android Apps Similarity, Vol. 152.
[15]
Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2020. Global Relational Models of Source Code. In International Conference on Learning Representations. https://openreview.net/forum?id=B1lnbRNtwr
[16]
Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-Means Clustering. Journal of Statistical Software 50 (09 2012), 1--22.
[17]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
[18]
S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency. Ann. Math. Statist. 22, 1 (03 1951), 79--86.
[19]
L. Li, T. F. Bissyandé, and J. Klein. 2017. SimiDroid: Identifying and Explaining Similarities in Android Apps. In 2017 IEEE Trustcom/BigDataSE/ICESS. 136--143.
[20]
M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk. 2016. On automatically detecting similar Android apps. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). 1--10.
[21]
S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (mar 1982), 129--137.
[22]
Vadim Markovtsev. 2017. GitHub word2vec 120k. https://data.world/vmarkovtsev/github-word-2-vec-120-k.
[23]
Vadim Markovtsev and Eiso Kant. 2017. Topic modeling of public repositories at scale using names in source code. arXiv preprint arXiv:1704.00135 (2017).
[24]
Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. Detecting Similar Software Applications. In Proceedings of the 34th International Conference on Software Engineering (ICSE '12). IEEE Press, 364--374.
[25]
Tom Mens, Alexander Serebrenik, and Anthony Cleve. 2014. Evolving Software Systems. Springer Publishing Company, Incorporated.
[26]
Martin F Porter. 2001. Snowball: A language for stemming algorithms.
[27]
Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 298--307.
[28]
Xiaobing Sun, Xiangyue Liu, Li Bin, Yucong Duan, Hui Yang, and Jiajun Hu. 2016. Exploring topic models in software engineering data analysis: A survey. 357--362.
[29]
F. Thung, D. Lo, and L. Jiang. 2012. Detecting similar applications with collaborative tagging. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). 600--603.
[30]
Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 2 (2001), 411--423. arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293
[31]
Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Machine Learning 101, 1 (01 Oct 2015), 303--323.
[32]
Yun Zhang, David Lo, Pavneet Singh Kochhar, Xin Xia, Quanlai Li, and Jianling Sun. 2017. Detecting similar repositories on GitHub. 13--23.

Cited By

View all
  • (2023)EGAD: A moldable tool for GitHub Action analysis2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00044(260-264)Online publication date: May-2023
  • (2022)CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM modelInternational Journal of Intelligent Systems10.1002/int.2281337:9(5768-5795)Online publication date: 9-Jan-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering
December 2020
1449 pages
ISBN:9781450367684
DOI:10.1145/3324884
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper

Conference

ASE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)EGAD: A moldable tool for GitHub Action analysis2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)10.1109/MSR59073.2023.00044(260-264)Online publication date: May-2023
  • (2022)CroLSSim: Cross‐language software similarity detector using hybrid approach of LSA‐based AST‐MDrep features and CNN‐LSTM modelInternational Journal of Intelligent Systems10.1002/int.2281337:9(5768-5795)Online publication date: 9-Jan-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media