skip to main content
10.1145/2503210.2503262acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Mr. Scan: extreme scale density-based clustering using a tree-based network of GPGPU nodes

Published: 17 November 2013 Publication History

Abstract

Density-based clustering algorithms are a widely-used class of data mining techniques that can find irregularly shaped clusters and cluster data without prior knowledge of the number of clusters it contains. DBSCAN is the most well-known density-based clustering algorithm. We introduce our version of DBSCAN, called Mr. Scan, which uses a hybrid parallel implementation that combines the MRNet tree-based distribution network with GPGPU-equipped nodes. Mr. Scan avoids the problems of existing implementations by effectively partitioning the point space and by optimizing DBSCAN's computation over dense data regions. We tested Mr. Scan on both a geolocated Twitter dataset and image data obtained from the Sloan Digital Sky Survey. At its largest scale, Mr. Scan clustered 6.5 billion points from the Twitter dataset on 8,192 GPU nodes on Cray Titan in 17.3 minutes. All other parallel DBSCAN implementations have only demonstrated the ability to cluster up to 100 million points.

References

[1]
SDSS - Baryon Oscillation Spectroscopic Survey, April 2013. http://www.sdss3.org/surveys/boss.php.
[2]
Sloan Digital Sky Survey, April 2013. www.sdss.org.
[3]
Twitter, April 2013. https://twitter.com.
[4]
E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, and A. Zimek. Spatial Outlier Detection: Data, Algorithms, Visualizations. In Advances in Spatial and Temporal Databases, volume 6849 of Lecture Notes in Computer Science, pages 512--516. Springer Berlin Heidelberg, 2011.
[5]
T. Ali, S. Asghar, and N. Sajid. Critical Analysis of DBSCAN Variations. In International Conference on Information and Emerging Technologies 2010 (ICIET 2010), Karachi, Pakistan, June 2010.
[6]
C. Böhm, R. Noll, C. Plant, and B. Wackersreuther. Density-Based Clustering using Graphics Processors. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09), pages 661--670, Hong Kong, China, November 2009. ACM.
[7]
L. Crosby. Performance Characteristics of the Lustre File System on the Cray XT5 with Respect to Application I/O Patterns. In Cray User Group 2009 Proceedings (CUG 2009), Atlanta, GA, USA, 2009.
[8]
A. Culotta. Lightweight Methods to Estimate Influenza Rates and Alcohol Sales Volume from Twitter Messages. Language Resources and Evaluation, 47(1):217--238, 2013.
[9]
B.-R. Dai and I.-C. Lin. Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition. In IEEE 5th International Conference of Cloud Computing (IEEE CLOUD 2012), Honolulu, HI, USA, June 2012.
[10]
S. Davidoff and P. Wozniak. RAPTOR-scan: Identifying and Tracking Objects Through Thousands of Sky Images. In Gamma-Ray Bursts: 30 Years of Discovery: Gamma-Ray Symposium, Santa Fe, NM, USA, September 2003.
[11]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In The Second International Conference on Knowledge Discovery and Data Mining (KDD '96), Portland, OR, USA, August 1996.
[12]
T. Gamblin, B. R. de Supinski, M. Schulz, R. J. Fowler, and D. A. Reed. Clustering Performance Data Efficiently at Massive Scales. In ACM/SIGARCH International Conference on Supercomputing (ICS 2010), Epochal Tsukuba, Tsukuba, Japan, June 2010.
[13]
Y. He. Personal communication, March 2013.
[14]
Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng, and J. Fan. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. In The 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS '11), Tainan, Taiwan, December 2011.
[15]
E. Januzaj, H.-P. Kriegel, and M. Pfeifle. DBDC: Density Based Distributed Clustering. In Int. Conf. on Extending Database Technology (EDBT '04), pages 88--105, Heraklion, Crete, Greece, March 2004.
[16]
A. Karandikar. Clustering Short Status Messages: A topic model based approach. Master's thesis, University of Maryland, Baltimore County, 2010.
[17]
S. Kisilevich, F. Mansmann, and D. A. Keim. P-DBSCAN: A Density Based Clustering Algorithm for Exploration and Analysis of Attractive Areas Using Collections of Geo-Tagged Photos. In 1st International Conference on Computing for Geospatial Research & Application (COM.Geo '10), Washington, DC, USA, June 2010.
[18]
M. Kryszkiewicz and P. Lasek. TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality. In The Seventh International Conference of Rough Sets and Current Trends in Computing (RSCTC 2010), Warsaw, Poland, June 2010.
[19]
M. Kryszkiewicz and L. Skonieczny. Faster Clustering with DBSCAN. In International Conference on Intelligent Information Systems 2005: New Trends in Intelligent Information Processing and Web Mining (IIPWM 2005), pages 605--614, Gdansk, Poland, June 2005.
[20]
V. Lampos. On Voting Intentions Inference from Twitter Content: A Case Study on UK 2010 General Election. ACM Computing Research Repository (CoRR), abs/1204.0423, 2012.
[21]
V. Lampos and N. Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM Transactions on Intelligent Systems and Technology (ACM TIST), 3(4):72, 2012.
[22]
T. Lansdall-Welfare, V. Lampos, and N. Cristianini. Effects of the Recession on Public Mood in the UK. In 22nd International World Wide Web Conference (WWW '12), pages 1221--1226, Lyon, France, April 2012.
[23]
R. Mills, F. M. Hoffman, J. Kumar, and W. W. Hargrove. Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats. Procedia CS, 4:1612--1621, 2011.
[24]
M. M. A. Patwary, D. Palsetia, A. Agrawal, W. keng Liao, F. Manne, and A. N. Choudhary. A New Scalable Parallel DBSCAN Algorithm using the Disjoint-Set Data Structure. In ACM/IEEE Supercomputing Conference 2012 (SC 2012), Salt Lake City, UT, USA, November 2012.
[25]
P. Roth, D. Arnold, and B. Miller. MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools. In ACM/IEEE Supercomputing Conference 2003 (SC 2003), Phoenix, Arizona, November 2003.
[26]
S. Sonntag, C. T. Paredes, J. Roth, and H.-R. Trebin. Molecular Dynamics Simulations of Cluster Distribution from Femtosecond Laser Ablation in Aluminum. Applied Physics A, 104(2):559--565, 2011.
[27]
Telegraph. Library of Congress Is Archiving All Of America's Tweets, January 2013. http://www.businessinsider.com/library-of-congress-is-archiving-all-of-americas-tweets-2013-1.
[28]
R. E. Wilson, S. D. Gosling, and L. T. Graham. A Review of Facebook Research in the Social Sciences. Perspectives on Psychological Science, 7(3):203--220, 2012.
[29]
X. Xu, J. Jäger, and H.-P. Kriegel. A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3(3):263--290, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
  • General Chair:
  • William Gropp,
  • Program Chair:
  • Satoshi Matsuoka
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC13
Sponsor:

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media