research-article

Document clustering with universum

Authors:

Luo SiAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 873 - 882

https://doi.org/10.1145/2009916.2010033

Published: 24 July 2011 Publication History

Abstract

Document clustering is a popular research topic, which aims to partition documents into groups of similar objects (i.e., clusters), and has been widely used in many applications such as automatic topic extraction, document organization and filtering. As a recently proposed concept, Universum is a collection of "non-examples" that do not belong to any concept/cluster of interest. This paper proposes a novel document clustering technique -- Document Clustering with Universum, which utilizes the Universum examples to improve the clustering performance. The intuition is that the Universum examples can serve as supervised information and help improve the performance of clustering, since they are known not belonging to any meaningful concepts/clusters in the target domain. In particular, a maximum margin clustering method is proposed to model both target examples and Universum examples for clustering. An extensive set of experiments is conducted to demonstrate the effectiveness and efficiency of the proposed algorithm.

References

[1]

R. Bekkerman, H. Raghavan, J. Allan, and K. Eguchi. Interactive clustering of text collections according to a user-specified criterion. In IJCAI, pages 684--689, 2007.

Digital Library

[2]

R. Bekkerman, S. Zilberstein, and J. Allan. Web page clustering using heuristic search in the web graph. In IJCAI, pages 2280--2285. IJCAI, 2006.

Digital Library

[3]

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, March 2004.

Digital Library

[4]

S. Chen and C. Zhang. Selecting informative universum sample for semi-supervised learning. In IJCAI, pages 1016--1021, 2009.

Digital Library

[5]

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2 edition, November 2001.

Digital Library

[6]

J. Gong and D. W. Oard. Selecting hierarchical clustering cut points for web person-name disambiguation. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR'09, pages 778--779, New York, NY, USA, 2009. ACM.

Digital Library

[7]

J. He, E. J. Meij, and M. de Rijke. Result diversification based on query-specific cluster ranking. Journal of the American Society of Information Science and Technology, 2011. To appear.

Digital Library

[8]

T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50--57, 1999.

Digital Library

[9]

K. Huang, Z. Xu, I. King, and M. R. Lyu. Semi-supervised learning from general unlabeled data. In ICDM, pages 273--282, 2008.

Digital Library

[10]

T. Joachims. Training linear svms in linear time. In KDD, pages 217--226, 2006.

Digital Library

[11]

J. Kelley Jr. The cutting-plane method for solving convex programs. Journal of the SIAM, pages 703--712, 1960.

[12]

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.

Digital Library

[13]

Q. Li, B. M. Kim, and S.-H. Myaeng. Clustering for probabilistic model estimation for cf. In WWW (Special interest tracks and posters), pages 1104--1105, 2005.

Digital Library

[14]

X. Liu, Y. Gong, W. Xu, and S. Zhu. Document clustering with cluster refinement and model selection capabilities. In SIGIR, pages 191--198, 2002.

Digital Library

[15]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 1 edition, July 2008.

Digital Library

[16]

C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization : Algorithms and Complexity. Dover Publications, July 1998.

Digital Library

[17]

W. M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846--850, 1971.

[18]

N. Sahoo, J. Callan, R. Krishnan, G. T. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM, pages 357--366, 2006.

Digital Library

[19]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888--905, 2000.

Digital Library

[20]

F. H. Sinz, O. Chapelle, A. Agarwal, and B. Schölkopf. An analysis of inference with the universum. In NIPS, 2007.

[21]

M. Spitters and W. Kraaij. Unsupervised clustering in multilingual news streams. In LREC 2002 workshoop: Event Modelling for Multilingual Document Linking, pages 42--46, 2002.

[22]

A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002.

Digital Library

[23]

T. Tao and C. Zhai. A mixture clustering model for pseudo feedback in information retrieval. In IFCS, 2004.

[24]

C. H. Teo, S. V. N. Vishwanathan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311--365, 2010.

Digital Library

[25]

H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. In NIPS, pages 1417--1424, 2006.

[26]

A. S. Vishwanathan, A. J. Smola, and S. V. N. Vishwanathan. Kernel methods for missing variables. In AISTAT, pages 325--332, 2005.

[27]

F. Wang, C. Zhang, and T. Li. Regularized clustering for documents. In SIGIR, pages 95--102, 2007.

Digital Library

[28]

J. Weston, R. Collobert, F. H. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In ICML, pages 1009--1016, 2006.

Digital Library

[29]

E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with side-information. In NIPS, pages 505--512, 2002.

Digital Library

[30]

L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In NIPS, 2004.

Digital Library

[31]

W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003.

Digital Library

[32]

H. Yang and J. P. Callan. Near-duplicate detection by instance-level constrained clustering. In SIGIR, pages 421--428. ACM, 2006.

Digital Library

[33]

S. X. Yu and J. Shi. Multiclass spectral clustering. In ICCV, pages 313--319, 2003.

Digital Library

[34]

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.

Digital Library

[35]

D. Zhang, J. Wang, F. Wang, and C. Zhang. Semi-supervised classification with universum. In SDM, pages 323--333, 2008.

[36]

B. Zhao, F. Wang, and C. Zhang. Efficient multiclass maximum margin clustering. In ICML, pages 1248--1255, 2008.

Digital Library

Cited By

Liu BLi LXiao YWang KHu JLiu JChen QHuang R(2023)An Efficient Transfer Learning Method with Auxiliary InformationACM Transactions on Knowledge Discovery from Data10.1145/361293018:1(1-23)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3612930
Yang XWang DPan JLi CShao Y(2023)Creating Universum for class imbalance via locality and its application in multiview subspace learningInformation Sciences10.1016/j.ins.2023.119478(119478)Online publication date: Aug-2023
https://doi.org/10.1016/j.ins.2023.119478
Liu BChen QXiao YWang KLiu JHuang RLi L(2023)Semi-supervised multi-task learning with auxiliary dataInformation Sciences10.1016/j.ins.2023.02.091Online publication date: Mar-2023
https://doi.org/10.1016/j.ins.2023.02.091
Show More Cited By

Index Terms

Document clustering with universum
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Research history generation using maximum margin clustering of research papers based on metainformation
iiWAS '11: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services

Our research aim is the automatic generation of a researcher's research history from research articles published on the internet. Research history generation based on the k-Means clustering algorithm has been proposed in previous work. However, the ...
Self-Universum support vector machine

In this paper, for an improved twin support vector machine (TWSVM), we give it a theoretical explanation based on the concept of Universum and then name it Self-Universum support vector machine (SUSVM). For the binary classification problem, SUSVM takes ...
Document Clustering Using Incremental and Pairwise Approaches
Focused Access to XML Documents
Abstract
This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
494
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu BLi LXiao YWang KHu JLiu JChen QHuang R(2023)An Efficient Transfer Learning Method with Auxiliary InformationACM Transactions on Knowledge Discovery from Data10.1145/361293018:1(1-23)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3612930
Yang XWang DPan JLi CShao Y(2023)Creating Universum for class imbalance via locality and its application in multiview subspace learningInformation Sciences10.1016/j.ins.2023.119478(119478)Online publication date: Aug-2023
https://doi.org/10.1016/j.ins.2023.119478
Liu BChen QXiao YWang KLiu JHuang RLi L(2023)Semi-supervised multi-task learning with auxiliary dataInformation Sciences10.1016/j.ins.2023.02.091Online publication date: Mar-2023
https://doi.org/10.1016/j.ins.2023.02.091
Li CLiu JMeng YShao Y(2023)Recursive universum linear discriminant analysisOptimization Letters10.1007/s11590-023-02067-918:6(1405-1419)Online publication date: 29-Sep-2023
https://doi.org/10.1007/s11590-023-02067-9
Zhu CWang PMiao DZhou RBalinsky ATan ATarng W(2019)Rank-consistency-based multi-view learning with UniversumProceedings of the 1st International Conference on Advanced Information Science and System10.1145/3373477.3373700(1-6)Online publication date: 15-Nov-2019
https://dl.acm.org/doi/10.1145/3373477.3373700
Zhu C(2019)Improved multiple matrix classifier with five kinds of sample informationInternational Journal of Information Technology10.1007/s41870-019-00308-8Online publication date: 26-Apr-2019
https://doi.org/10.1007/s41870-019-00308-8
Zhu CMiao DZhou RWei L(2019)Weight-and-Universum-based semi-supervised multi-view learning machineSoft Computing10.1007/s00500-019-04572-5Online publication date: 3-Dec-2019
https://doi.org/10.1007/s00500-019-04572-5
Chen XYin HJiang FWang L(2018)Multi-view dimensionality reduction based on Universum learningNeurocomputing10.1016/j.neucom.2017.11.006275:C(2279-2286)Online publication date: 31-Jan-2018
https://dl.acm.org/doi/10.1016/j.neucom.2017.11.006
Du YLiu JKe WGong X(2018)Hierarchy construction and text classification based on the relaxation strategy and least information modelExpert Systems with Applications: An International Journal10.1016/j.eswa.2018.02.003100:C(157-164)Online publication date: 15-Jun-2018
https://dl.acm.org/doi/10.1016/j.eswa.2018.02.003
Zhu CWang Z(2018)Semi-supervised soft margin consistency based multi-view maximum entropy discriminationApplied Computing and Informatics10.1016/j.aci.2017.10.004Online publication date: Aug-2018
https://doi.org/10.1016/j.aci.2017.10.004
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents