skip to main content
10.1145/2009916.2010033acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Document clustering with universum

Published: 24 July 2011 Publication History

Abstract

Document clustering is a popular research topic, which aims to partition documents into groups of similar objects (i.e., clusters), and has been widely used in many applications such as automatic topic extraction, document organization and filtering. As a recently proposed concept, Universum is a collection of "non-examples" that do not belong to any concept/cluster of interest. This paper proposes a novel document clustering technique -- Document Clustering with Universum, which utilizes the Universum examples to improve the clustering performance. The intuition is that the Universum examples can serve as supervised information and help improve the performance of clustering, since they are known not belonging to any meaningful concepts/clusters in the target domain. In particular, a maximum margin clustering method is proposed to model both target examples and Universum examples for clustering. An extensive set of experiments is conducted to demonstrate the effectiveness and efficiency of the proposed algorithm.

References

[1]
R. Bekkerman, H. Raghavan, J. Allan, and K. Eguchi. Interactive clustering of text collections according to a user-specified criterion. In IJCAI, pages 684--689, 2007.
[2]
R. Bekkerman, S. Zilberstein, and J. Allan. Web page clustering using heuristic search in the web graph. In IJCAI, pages 2280--2285. IJCAI, 2006.
[3]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004.
[4]
S. Chen and C. Zhang. Selecting informative universum sample for semi-supervised learning. In IJCAI, pages 1016--1021, 2009.
[5]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2 edition, November 2001.
[6]
J. Gong and D. W. Oard. Selecting hierarchical clustering cut points for web person-name disambiguation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR'09, pages 778--779, New York, NY, USA, 2009. ACM.
[7]
J. He, E. J. Meij, and M. de Rijke. Result diversification based on query-specific cluster ranking. Journal of the American Society of Information Science and Technology, 2011. To appear.
[8]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.
[9]
K. Huang, Z. Xu, I. King, and M. R. Lyu. Semi-supervised learning from general unlabeled data. In ICDM, pages 273--282, 2008.
[10]
T. Joachims. Training linear svms in linear time. In KDD, pages 217--226, 2006.
[11]
J. Kelley Jr. The cutting-plane method for solving convex programs. Journal of the SIAM, pages 703--712, 1960.
[12]
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.
[13]
Q. Li, B. M. Kim, and S.-H. Myaeng. Clustering for probabilistic model estimation for cf. In WWW (Special interest tracks and posters), pages 1104--1105, 2005.
[14]
X. Liu, Y. Gong, W. Xu, and S. Zhu. Document clustering with cluster refinement and model selection capabilities. In SIGIR, pages 191--198, 2002.
[15]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July 2008.
[16]
C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization : Algorithms and Complexity. Dover Publications, July 1998.
[17]
W. M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846--850, 1971.
[18]
N. Sahoo, J. Callan, R. Krishnan, G. T. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM, pages 357--366, 2006.
[19]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888--905, 2000.
[20]
F. H. Sinz, O. Chapelle, A. Agarwal, and B. Schölkopf. An analysis of inference with the universum. In NIPS, 2007.
[21]
M. Spitters and W. Kraaij. Unsupervised clustering in multilingual news streams. In LREC 2002 workshoop: Event Modelling for Multilingual Document Linking, pages 42--46, 2002.
[22]
A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002.
[23]
T. Tao and C. Zhai. A mixture clustering model for pseudo feedback in information retrieval. In IFCS, 2004.
[24]
C. H. Teo, S. V. N. Vishwanathan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311--365, 2010.
[25]
H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In NIPS, pages 1417--1424, 2006.
[26]
A. S. Vishwanathan, A. J. Smola, and S. V. N. Vishwanathan. Kernel methods for missing variables. In AISTAT, pages 325--332, 2005.
[27]
F. Wang, C. Zhang, and T. Li. Regularized clustering for documents. In SIGIR, pages 95--102, 2007.
[28]
J. Weston, R. Collobert, F. H. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In ICML, pages 1009--1016, 2006.
[29]
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with side-information. In NIPS, pages 505--512, 2002.
[30]
L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004.
[31]
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003.
[32]
H. Yang and J. P. Callan. Near-duplicate detection by instance-level constrained clustering. In SIGIR, pages 421--428. ACM, 2006.
[33]
S. X. Yu and J. Shi. Multiclass spectral clustering. In ICCV, pages 313--319, 2003.
[34]
L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.
[35]
D. Zhang, J. Wang, F. Wang, and C. Zhang. Semi-supervised classification with universum. In SDM, pages 323--333, 2008.
[36]
B. Zhao, F. Wang, and C. Zhang. Efficient multiclass maximum margin clustering. In ICML, pages 1248--1255, 2008.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
July 2011
1374 pages
ISBN:9781450307574
DOI:10.1145/2009916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. constrained concave-convex procedure (cccp)
  3. maximum margin clustering
  4. universum

Qualifiers

  • Research-article

Conference

SIGIR '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media