skip to main content
10.1145/2396761.2396844acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Improving document clustering using automated machine translation

Published: 29 October 2012 Publication History

Abstract

With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.

References

[1]
Reuters RCV1/RCV2 multilingual dataset: http://multilingreuters.iit.nrc.ca/ReutersMultiLingualMultiView.htm
[2]
M.-R. Amini and C. Goutte. A co-classification approach to learning from multilingual corpora. Machine Learning, 79(1--2):105--121, 2010.
[3]
M.-R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views - an application to multilingual text categorization. In NIPS, pages 28--36, 2009.
[4]
Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, editors. Templates for the solution of Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia, 2000.
[5]
S. Basu, I. Davidson, and K. Wagstaff, editors. Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 2008.
[6]
M. W. Berry, editor. Survey of text mining: clustering, classification, and retrieval. Springer, 2004.
[7]
A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, pages 92--100, 1998.
[8]
C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, pages 396--404, 2009.
[9]
I. Davidson and S. S. Ravi. Identifying and generating easy sets of constraints for clustering. In AAAI, pages 336--341, 2006.
[10]
I. Davidson, K. Wagstaff, and S. Basu. Measuring constraint-set utility for partitional clustering algorithms. In PKDD, pages 115--126, 2006.
[11]
R. Horn and C. Johnson. Matrix analysis. Cambridge Univ. Press, 1990.
[12]
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193--218, 1985.
[13]
Y.-M. Kim, M.-R. Amini, C. Goutte, and P. Gallinari. Multi-view clustering of multilingual documents. In SIGIR, pages 821--822, 2010.
[14]
H. Kuhn and A. Tucker. Nonlinear programming. ACM SIGMAP Bulletin, pages 6--18, 1982.
[15]
A. Kumar and H. D. III. A co-training approach for multi-view spectral clustering. In ICML, pages 393--400, 2011.
[16]
A. Kumar, P. Rai, and H. D. III. Co-regularized multi-view spectral clustering. In NIPS, pages 1413--1421, 2011.
[17]
A. Lopez. Statistical machine translation. ACM Comput. Surv., 40(3), 2008.
[18]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888--905, 2000.
[19]
N. Ueffing, M. Simard, S. Larkin, and J. H. Johnson. NRC's PORTAGE system for WMT 2007. In ACL-2007 Second Workshop on SMT, pages 185--188, 2007.
[20]
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007.
[21]
X. Wang and I. Davidson. Flexible constrained spectral clustering. In KDD, pages 563--572, 2010.
[22]
D. Zhou and C. Burges. Spectral clustering and transductive learning with multiple views. In ICML, pages 1159--1166, 2007.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. constrained spectral clustering
  2. document clustering
  3. machine translation

Qualifiers

  • Research-article

Conference

CIKM'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media