skip to main content
10.5555/1283383.1283494acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
Article

k-means++: the advantages of careful seeding

Published: 07 January 2007 Publication History

Abstract

The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

References

[1]
Pankaj K. Agarwal and Nabil H. Mustafa. k-means projective clustering. In PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 155--165, New York, NY, USA, 2004. ACM Press.
[2]
D. Arthur and S. Vassilvitskii. Worst-case and smoothed analysis of the ICP algorithm, with an application to the k-means method. In Symposium on Foundations of Computer Science, 2006.
[3]
David Arthur and Sergei Vassilvitskii. k-means++ test code. http://www.stanford.edu/~darthur/kMeansppTest.zip.
[4]
David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In SCG '06: Proceedings of the twenty-second annual symposium on computational geometry. ACM Press, 2006.
[5]
Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[6]
Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In STOC '03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30--39, New York, NY, USA, 2003. ACM Press.
[7]
Philippe Collard's cloud cover database. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.
[8]
Sanjoy Dasgupta. How fast is k-means? In Bernhard Schölkopf and Manfred K. Warmuth, editors, COLT, volume 2777 of Lecture Notes in Computer Science, page 735. Springer, 2003.
[9]
W. Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Approximation schemes for clustering problems. In STOC '03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 50--58, New York, NY, USA, 2003. ACM Press.
[10]
P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the singular value decomposition. Mach. Learn., 56(1-3):9--33, 2004.
[11]
Frédéric Gibou and Ronald Fedkiw. A fast hybrid k-means level set algorithm for segmentation. In 4th Annual Hawaii International Conference on Statistics and Mathematics, pages 281--291, 2005.
[12]
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515--528, 2003.
[13]
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In STOC '04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291--300, New York, NY, USA, 2004. ACM Press.
[14]
Sariel Har-Peled and Bardia Sadri. How fast is the k-means method? In SODA '05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 877--885, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics.
[15]
R. Herwig, A. J. Poustka, C. Muller, C. Bull, H. Lehrach, and J O'Brien. Large-scale clustering of cdna-fingerprinting data. Genome Research, 9:1093--1105, 1999.
[16]
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In SCG '94: Proceedings of the tenth annual symposium on Computational geometry, pages 332--339, New York, NY, USA, 1994. ACM Press.
[17]
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89--112, 2004.
[18]
KDD Cup 1999 dataset. http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.
[19]
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In FOCS '04: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS'04), pages 454--462, Washington, DC, USA, 2004. IEEE Computer Society.
[20]
Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129--136, 1982.
[21]
Jiri Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61--84, 2000.
[22]
Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. In Adnan Darwiche and Nir Friedman, editors, UAI, pages 344--351. Morgan Kaufmann, 2002.
[23]
A. Meyerson. Online facility location. In FOCS '01: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, page 426, Washington, DC, USA, 2001. IEEE Computer Society.
[24]
R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-Means problem. In Symposium on Foundations of Computer Science, 2006.
[25]
Spam e-mail database. http://www.ics.uci.edu/~mlearn/databases/spambase/.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SODA '07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
January 2007
1322 pages
ISBN:9780898716245
  • Conference Chair:
  • Harold Gabow

Sponsors

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 07 January 2007

Check for updates

Qualifiers

  • Article

Acceptance Rates

SODA '07 Paper Acceptance Rate 139 of 382 submissions, 36%;
Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)570
  • Downloads (Last 6 weeks)60
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media