Article

A probabilistic framework for semi-supervised clustering

Authors:

Mikhail Bilenko,

Raymond J. MooneyAuthors Info & Claims

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 59 - 68

https://doi.org/10.1145/1014052.1014062

Published: 22 August 2004 Publication History

Abstract

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

References

[1]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.

Digital Library

[2]

A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra. Generative model-based clustering of directional data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 19--28, 2003.

Digital Library

[3]

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.

[4]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science (FOCS-02), pages 238--247, 2002.

Digital Library

[5]

A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In Proceedings of 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.

[6]

S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding. In Proceedings of 19th International Conference on Machine Learning (ICML-02), pages 19--26, 2002.

Digital Library

[7]

S. Basu, A. Banerjee, and R. J. Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.

[8]

S. Basu, M. Bilenko, and R. J. Mooney. Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering. In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 42--49, 2003.

Digital Library

[9]

J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B (Methodological), 48(3):259--302, 1986.

[10]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), pages 39--48, 2003.

Digital Library

[11]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998.

Digital Library

[12]

Y. Boykov, O. Veksler, and R. Zabih. Markov random fields with efficient approximations. In Proceedings of IEEE Computer Vision and Pattern Recognition Conference (CVPR-98), pages 648--655, 1998.

Digital Library

[13]

D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.

[14]

T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.

Digital Library

[15]

A. Demiriz, K. P. Bennett, and M. J. Embrechts. Semi-supervised clustering using genetic algorithms. In Artificial Neural Networks in Engineering (ANNIE-99), pages 809--814, 1999.

[16]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.

[17]

I. S. Dhillon and Y. Guan. Information theoretic clustering of sparse co-occurrence data. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), pages 517--521, 2003.

Digital Library

[18]

I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143--175, 2001.

Digital Library

[19]

B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.

[20]

M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95:14863--14848, 1998.

[21]

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721--742, 1984.

Digital Library

[22]

J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971.

[23]

D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, 1985.

Digital Library

[24]

T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 200--209, 1999.

Digital Library

[25]

S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 561--566, 2003.

Digital Library

[26]

M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 282--293, 1997.

Digital Library

[27]

D. Klein, S. D. Kamvar, and C. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceedings of the The Nineteenth International Conference on Machine Learning (ICML-02), pages 307--314, 2002.

Digital Library

[28]

J. Kleinberg and E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS-99), pages 14--23, 1999.

Digital Library

[29]

J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.

[30]

E. M. Marcotte, I. Xenarios, A. van der Bliek, and D. Eisenberg. Localizing proteins in the cell from their phylogenetic profiles. Proceedings of the National Academy of Science, 97:12115--20, 2000.

[31]

K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.

[32]

R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355--368. MIT Press, 1998.

Digital Library

[33]

K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000.

Digital Library

[34]

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo,CA, 1988.

Digital Library

[35]

F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL-93), pages 183--190, 1993.

Digital Library

[36]

E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264--i272, July 2003.

[37]

A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI 2000 Workshop on Artificial Intelligence for Web Search, pages 58--64, July 2000.

[38]

K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-Means clustering with background knowledge. In Proceedings of 18th International Conference on Machine Learning (ICML-01), pages 577--584, 2001.

Digital Library

[39]

E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505--512, Cambridge, MA, 2003. MIT Press.

Digital Library

[40]

Y. Zhang, M. Brady, and S. Smith. Hidden Markov random field model and segmentation of brain MR images. IEEE Transactions on Medical Imaging, 20(1):45--57, 2001.

Cited By

Pang HQi XXiao CXu ZDing GChang YYang XDuan T(2024)Pottery evolution pattern discovery based on deep learning: case study of Miaozigou culture in ChinaHeritage Science10.1186/s40494-024-01468-y12:1Online publication date: 10-Oct-2024
https://doi.org/10.1186/s40494-024-01468-y
Rass SKönig SAhmad SGoman M(2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3420246
Liu PQian WLi HCao J(2024)Semi-Supervised Dimensional Media Sentiment Analysis via Exploring Sample RelationshipsIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330768511:4(5298-5307)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2023.3307685
Show More Cited By

Index Terms

A probabilistic framework for semi-supervised clustering
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Semi-supervised clustering with discriminative random fields

Semi-supervised clustering exploits a small quantity of supervised information to improve the accuracy of data clustering. In this paper, a framework for semi-supervised clustering is proposed. This framework is capable of integrating with a traditional ...
Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining

Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...
Semi-supervised hybrid clustering by integrating Gaussian mixture model and distance metric learning

Semi-supervised clustering aim to aid and bias the unsupervised clustering by employing a small amount of supervised information. The supervised information is generally given as pairwise constraints, which was used to either modify the objective ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2004

874 pages

ISBN:1581138881

DOI:10.1145/1014052

General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD04

Sponsor:

KDD04: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 22 - 25, 2004

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

570
Total Citations
View Citations
4,397
Total Downloads

Downloads (Last 12 months)95
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pang HQi XXiao CXu ZDing GChang YYang XDuan T(2024)Pottery evolution pattern discovery based on deep learning: case study of Miaozigou culture in ChinaHeritage Science10.1186/s40494-024-01468-y12:1Online publication date: 10-Oct-2024
https://doi.org/10.1186/s40494-024-01468-y
Rass SKönig SAhmad SGoman M(2024)Metricizing the Euclidean Space Toward Desired Distance Relations in Point CloudsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342024619(7304-7319)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3420246
Liu PQian WLi HCao J(2024)Semi-Supervised Dimensional Media Sentiment Analysis via Exploring Sample RelationshipsIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.330768511:4(5298-5307)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2023.3307685
Hennessey SWilliams FKuncheva L(2024)Hierarchical Vs Centroid-Based Constraint Clustering for Animal Video Data2024 IEEE 12th International Conference on Intelligent Systems (IS)10.1109/IS61756.2024.10705263(1-6)Online publication date: 29-Aug-2024
https://doi.org/10.1109/IS61756.2024.10705263
Chen JXie SYang HNie F(2024)Effective semi-supervised graph clustering with pairwise constraintsInformation Sciences10.1016/j.ins.2024.121249(121249)Online publication date: Jul-2024
https://doi.org/10.1016/j.ins.2024.121249
Dao TVrain C(2024)A review on declarative approaches for constrained clusteringInternational Journal of Approximate Reasoning10.1016/j.ijar.2024.109135(109135)Online publication date: Mar-2024
https://doi.org/10.1016/j.ijar.2024.109135
Ghamry FEl-Banby GEl-Fishawy AEl-Samie FDessouky M(2024)A survey of anomaly detection techniquesJournal of Optics10.1007/s12596-023-01147-4Online publication date: 16-Feb-2024
https://doi.org/10.1007/s12596-023-01147-4
Nguyen-Trang TNguyen-Hoang YVo-Van T(2024)A new semi-supervised clustering algorithm for probability density functions and applicationsNeural Computing and Applications10.1007/s00521-023-09404-036:11(5965-5980)Online publication date: 16-Jan-2024
https://doi.org/10.1007/s00521-023-09404-0
Liu MZheng RZhang SLiu MZheng RZhang S(2024)Underwater Target Recognition via USNsUnderwater Information Perception and Processing Via Underwater Sensor Networks10.1007/978-981-97-4669-9_4(117-174)Online publication date: 16-Nov-2024
https://doi.org/10.1007/978-981-97-4669-9_4
Guglielmi NLubich CGillis NGuglielmi NLubich CMehrmann VSharma PVandereycken B(2024)Structured Linear Stability ProblemsRecent Stability Issues for Linear Dynamical Systems10.1007/978-3-031-71326-2_2(85-125)Online publication date: 14-Dec-2024
https://doi.org/10.1007/978-3-031-71326-2_2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten