research-article

DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN

Authors:

Erich Schubert,

Hans Peter Kriegel,

Xiaowei XuAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 42, Issue 3

Article No.: 19, Pages 1 - 21

https://doi.org/10.1145/3068335

Published: 31 July 2017 Publication History

Abstract

At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.

References

[1]

Pankaj K. Agarwal, Herbert Edelsbrunner, Otfried Schwarzkopf, and Emo Welzl. 1991. Euclidean minimum spanning trees and bichromatic closest pairs. Discrete 8 Computational Geometry 6, 1 (1991), 407--422.

Digital Library

[2]

Gennady L. Andrienko, Natalia V. Andrienko, Christophe Hurter, Salvatore Rinzivillo, and Stefan Wrobel. 2011. From movement tracks through events to places: Extracting and characterizing significant places from mobility data. In Proceedings of the 2011 IEEE Symposium on Visual Analytics Science and Technology (VAST). 161--170.

[3]

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 49--60.

Digital Library

[4]

Lars Arge, Mark de Berg, Herman J. Haverkort, and Ke Yi. 2004. The priority R-tree: A practically efficient and worst-case optimal R-tree. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 347--358.

Digital Library

[5]

Sunil Arya and David M. Mount. 1993. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the 4th Annual ACM/SIGACT-SIAM Symposium on Discrete Algorithms (SODA). 271--280.

[6]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 322--331.

Digital Library

[7]

Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Communications of the ACM 18, 9 (1975), 509--517.

Digital Library

[8]

Stefan Berchtold, Christian Böhm, Bernhard Braunmüller, Daniel A. Keim, and Hans-Peter Kriegel. 1997. Fast parallel similarity search in multimedia databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 1--12.

Digital Library

[9]

Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT). 217--235.

[10]

Alina Beygelzimer, Sham Kakade, and John Langford. 2006. Cover trees for nearest neighbors. In Proceedings of the 23rd International Conference on Machine Learning (ICML). 97--104.

Digital Library

[11]

Ergun Biçici and Deniz Yuret. 2007. Locally scaled density based clustering. In Proceedings of the 8th International Conference on Adaptive and Natural Computing Algorithms (ICANNGA) 2007. 739--748.

Digital Library

[12]

Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1994. Multi-step processing of spatial joins. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 197--208.

Digital Library

[13]

Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 160--172.

[14]

Michel Marie Deza and Elena Deza. 2009. Encyclopedia of Distances (3rd ed.). Springer.

[15]

Jeff Erickson. 1995. On the relative complexities of some geometric problems. In Proceedings of the 7th Canadian Conference on Computational Geometry. 85--90.

[16]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). 226--231.

[17]

Christos Faloutsos and Ibrahim Kamel. 1994. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 4--13.

Digital Library

[18]

Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the ACM International Conference on Management of Data (SIGMOD). 519--530.

Digital Library

[19]

Ade Gunawan. 2013. A faster algorithm for DBSCAN. Master’s thesis. Technical University of Eindhoven. http://repository.tue.nl/760643.

[20]

Mark A. Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations 11, 1 (2009), 10--18.

Digital Library

[21]

Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and Techniques (3rd ed.). Morgan Kaufmann.

Digital Library

[22]

Joseph M. Hellerstein, Elias Koutsoupias, and Christos H. Papadimitriou. 1997. On the analysis of indexing schemes. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 249--256.

Digital Library

[23]

Kirsten Hildrum, John Kubiatowicz, Sean Ma, and Satish Rao. 2004. A note on the nearest neighbor in growth-restricted metrics. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 560--561.

[24]

Alexander Hinneburg and Daniel A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th ACM International Conference on Knowledge Discovery and Data Mining (KDD). 58--65.

[25]

Michael E. Houle, Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. 2010. Can shared-neighbor distances defeat the curse of dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM). 482--500.

[26]

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC). 604--613.

Digital Library

[27]

David R. Karger and Matthias Ruhl. 2002. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC). 741--750.

Digital Library

[28]

Robert Krauthgamer and James R. Lee. 2004. Navigating nets: Simple algorithms for proximity search. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 798--807.

[29]

Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. 2016. The (black) art of runtime evaluation: Are we comparing algorithms or implementations?Knowledge and Information Systems (KAIS) (Oct. 2016).

Digital Library

[30]

Moshe Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.

[31]

Shaaban Mahran and Khaled Mahar. 2008. Using grid for accelerating density-based clustering. In Proceedings of 8th IEEE International Conference on Computer and Information Technology (CIT’08). 35--40.

[32]

Son T. Mai, Ira Assent, and Martin Storgaard. 2016. AnyDBC: An efficient anytime density-based clustering algorithm for very large complex datasets. In Proceedings of the 22st ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 1025--1034.

Digital Library

[33]

Yannis Manolopoulos, Alexandros Nanopoulos, Apostolos N. Papadopoulos, and Yannis Theodoridis. 2003. R-trees have Grown Everywhere. Technical Report. Retrieved from http://www.rtreeportal.org/.

[34]

Jirí Matousek. 1991. Reporting points in halfspaces. In Proceedings of the 32nd Annual IEEE Symposium on Foundations of Computer Science (FOCS). 207--215.

Digital Library

[35]

Jirí Matousek. 1994. Geometric range searching. Computing Surveys 26, 4 (1994), 421--461.

Digital Library

[36]

Apostolos Papadopoulos and Yannis Manolopoulos. 1997. Performance of nearest neighbor queries in R-trees. In Proceedings of the 6th International Conference on Database Theory (ICDT). 394--408.

[37]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[38]

Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 16th International Symposium on Wearable Computers (ISWC’12). 108--109.

Digital Library

[39]

Vasilis Samoladas and Daniel P. Miranker. 1998. A lower bound theorem for indexing schemes and its application to multidimensional range queries. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. 44--51.

Digital Library

[40]

Jörg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 1998. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2, 2 (1998), 169--194.

Digital Library

[41]

Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur Schmid, and Arthur Zimek. 2015. A framework for clustering uncertain data. Proceedings of the VLDB Endowment 8, 12 (2015), 1976--1979. https://elki-project.github.io/.

Digital Library

[42]

Erich Schubert, Arthur Zimek, and Hans-Peter Kriegel. 2013. Geodetic distance queries on r-trees for indexing geographic data. In Proceedings of the 13th International Symposium on Spatial and Temporal Databases (SSTD). 146--164.

Digital Library

[43]

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2006. Introduction to Data Mining. Addison Wesley.

[44]

R Core Team. 2015. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Retrieved from http://www.r-project.org/.

[45]

Wei Wang, Jiong Yang, and Richard R. Muntz. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB). 186--195.

[46]

Roger Weber, Hans-Jörg Schek, and Stephen Blott. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases (VLDB). 194--205.

[47]

Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. 2007. SCAN: A structural clustering algorithm for networks. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD). 824--833.

Digital Library

[48]

Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387.

Digital Library

Cited By

Kou YGao XZhang YLiu TAn GYe FTian YChen Y(2025)Hierarchical Optimization Segmentation and Parameter Extraction of Street Trees Based on Topology Checking and Boundary Analysis from LiDAR Point CloudsSensors10.3390/s2501018825:1(188)Online publication date: 1-Jan-2025
https://doi.org/10.3390/s25010188
Li JLv YZhou ZDu ZWei QXu K(2025)Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market DatabasesEnergies10.3390/en1801017618:1(176)Online publication date: 3-Jan-2025
https://doi.org/10.3390/en18010176
Wang JWu JQu YXiao QGao QFang Y(2025)Multi-Target Device-Free Positioning Based on Spatial-Temporal mmWave Point CloudIEEE Transactions on Mobile Computing10.1109/TMC.2024.347467124:2(1163-1180)Online publication date: Feb-2025
https://doi.org/10.1109/TMC.2024.3474671
Show More Cited By

Index Terms

DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
A new hybrid method based on partitioning-based DBSCAN and ant clustering

Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based ...
Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms
ICDCN '17: Proceedings of the 18th International Conference on Distributed Computing and Networking

DBSCAN is one of the most popular density-based clustering algorithm capable of identifying arbitrary shaped clusters and noise. It is computationally expensive for large data sets. In this paper, we present a grid-based DBSCAN algorithm, GridDBSCAN, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 42, Issue 3

Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence

September 2017

220 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3129336

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 July 2017

Accepted: 01 March 2017

Revised: 01 March 2017

Received: 01 November 2015

Published in TODS Volume 42, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1,647
Total Citations
View Citations
15,684
Total Downloads

Downloads (Last 12 months)2,574
Downloads (Last 6 weeks)362

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kou YGao XZhang YLiu TAn GYe FTian YChen Y(2025)Hierarchical Optimization Segmentation and Parameter Extraction of Street Trees Based on Topology Checking and Boundary Analysis from LiDAR Point CloudsSensors10.3390/s2501018825:1(188)Online publication date: 1-Jan-2025
https://doi.org/10.3390/s25010188
Li JLv YZhou ZDu ZWei QXu K(2025)Identification and Correction of Abnormal, Incomplete Power Load Data in Electricity Spot Market DatabasesEnergies10.3390/en1801017618:1(176)Online publication date: 3-Jan-2025
https://doi.org/10.3390/en18010176
Wang JWu JQu YXiao QGao QFang Y(2025)Multi-Target Device-Free Positioning Based on Spatial-Temporal mmWave Point CloudIEEE Transactions on Mobile Computing10.1109/TMC.2024.347467124:2(1163-1180)Online publication date: Feb-2025
https://doi.org/10.1109/TMC.2024.3474671
Xu ZYe TChen LGao YChen Z(2025)Health-Radar: Noncontact Multitarget Heart Rate Variability Detection Using FMCW RadarIEEE Sensors Journal10.1109/JSEN.2024.349475525:1(405-418)Online publication date: 1-Jan-2025
https://doi.org/10.1109/JSEN.2024.3494755
Qin YChen JJin LYao RGong Z(2025)Task offloading optimization in mobile edge computing based on a deep reinforcement learning algorithm using density clustering and ensemble learningScientific Reports10.1038/s41598-024-84038-315:1Online publication date: 2-Jan-2025
https://doi.org/10.1038/s41598-024-84038-3
Tang JYuan QZhang LMarshall BYen Tay AWang L(2025)Application of machine learning-assisted surface-enhanced Raman spectroscopy in medical laboratories: Principles, opportunities, and challengesTrAC Trends in Analytical Chemistry10.1016/j.trac.2025.118135184(118135)Online publication date: Mar-2025
https://doi.org/10.1016/j.trac.2025.118135
Seok UByun JSong J(2025)Disaster risk-informed optimization using buffered failure probability for regional-scale building retrofit strategyStructural Safety10.1016/j.strusafe.2024.102556114(102556)Online publication date: May-2025
https://doi.org/10.1016/j.strusafe.2024.102556
Wang WLi CZhang ZChen JHe SFeng Y(2025)Pseudo-label assisted contrastive learning model for unsupervised open-set domain adaptation in fault diagnosisReliability Engineering & System Safety10.1016/j.ress.2024.110650254(110650)Online publication date: Feb-2025
https://doi.org/10.1016/j.ress.2024.110650
Silinto BEdeme DCorigliano SDimovski AMerlo MZuidema CFaaij A(2025)An integrated geospatial modelling framework of hybrid microgrid sizing for rural electrification planningMethodsX10.1016/j.mex.2025.103153(103153)Online publication date: Jan-2025
https://doi.org/10.1016/j.mex.2025.103153
Golmohammadi AHernando DVan den bergh WHasheminejad N(2025)Advanced data-driven FBG sensor-based pavement monitoring system using multi-sensor data fusion and an unsupervised learning approachMeasurement10.1016/j.measurement.2024.115821242(115821)Online publication date: Jan-2025
https://doi.org/10.1016/j.measurement.2024.115821
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents