skip to main content
research-article

An Optimization Framework for Combining Ensembles of Classifiers and Clusterers with Applications to Nontransductive Semisupervised Learning and Transfer Learning

Published: 25 August 2014 Publication History

Abstract

Unsupervised models can provide supplementary soft constraints to help classify new “target” data because similar instances in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place, as in transfer learning settings. This article describes a general optimization framework that takes as input class membership estimates from existing classifiers learned on previously encountered “source” (or training) data, as well as a similarity matrix from a cluster ensemble operating solely on the target (or test) data to be classified, and yields a consensus labeling of the target data. More precisely, the application settings considered are nontransductive semisupervised and transfer learning scenarios where the training data are used only to build an ensemble of classifiers and are subsequently discarded before classifying the target data. The framework admits a wide range of loss functions and classification/clustering methods. It exploits properties of Bregman divergences in conjunction with Legendre duality to yield a principled and scalable approach. A variety of experiments show that the proposed framework can yield results substantially superior to those provided by naïvely applying classifiers learned on the original task to the target data. In addition, we show that the proposed approach, even not being conceptually transductive, can provide better results compared to some popular transductive learning techniques.

References

[1]
A. Acharya, E. R. Hruschka, J. Ghosh, and S. Acharyya. 2011. C3 E: A framework for combining ensembles of classifiers and clusterers. In Proceedings of the 10th International Workshop on MCS, 269--278.
[2]
A. Acharya, E.R. Hruschka, J. Ghosh, and S. Acharyya. 2012. Transfer learning with cluster ensembles. In JMLR Workshop and Conference Proceedings 27 (2012), 123--132.
[3]
A. Banerjee, S. Merugu, Inderjit S. Dhillon, and J. Ghosh. 2005. Clustering with Bregman divergences. Journal of Machine Learning Research 6 (December 2005), 1705--1749.
[4]
M. Belkin, P. Niyogi, and V. Sindhwani. 2005. On manifold regularization. In AISTAT.
[5]
Y. Bengio, O. Delalleau, and N. Le Roux. 2006. Label propagation and quadratic criterion. In Semi-Supervised Learning, Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (Eds.). MIT Press, 193--216.
[6]
J. Bezdek and R. Hathaway. 2002. Some notes on alternating optimization. In Advances in Soft Computing AFSS 2002, Nikhil Pal and Michio Sugeno (Eds.). Lecture Notes in Computer Science, Vol. 2275. Springer, Berlin, 187--195.
[7]
J. C. Bezdek and R. J. Hathaway. 2003. Convergence of alternating optimization. Neural, Parallel Science Computing 11, 4 (2003), 351--368.
[8]
A. Blum. 1998. On-line algorithms in machine learning. In Online Algorithms: The State of the Art, Fiat and Woeginger (Eds.). LNCS Vol.1442, Springer.
[9]
K. D. Bollacker and J. Ghosh. 2000. Knowledge transfer mechanisms for characterizing image datasets. In Soft Computing and Image Processing. Physica-Verlag, Heidelberg.
[10]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Technical Report.
[11]
L. M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. U.S.S.R. Computing, Mathematics and Mathematical Physics 7, 3 (1967), 200--217.
[12]
W. Cai, S. Chen, and D. Zhang. 2009. A simultaneous learning framework for clustering and classification. Pattern Recognition 42, 7 (July 2009), 1248--1259.
[13]
R. Caruana. 1997. Multitask Learning. Machine Learning 28, 1 (July 1997), 41--75.
[14]
Y. Al Censor and S. A. Zenios. 1997. Parallel Optimization: Theory, Algorithms and Applications. Oxford University Press.
[15]
O. Chapelle, B. Schölkopf, and A. Zien. 2006. Semi-Supervised Learning (Adaptive Computation and Machine Learning). The MIT Press.
[16]
S. Chen, G. Guo, and L. Chen. 2009. Semi-supervised classification based on clustering ensembles. In Proceedings of AICI’09. Springer-Verlag, 629--638.
[17]
W. Cheney and A. A. Goldstein. 1959. Proximity maps for convex sets. Proceedings of the American Mathematical Society 10, 3 (1959), 448--450.
[18]
A. Corduneanu and T. Jaakkola. 2003. On information regularization. In UAI. 151--158.
[19]
I. Csiszár and G. Tusnády. 1984. Information geometry and alternating minimization procedures. Statistics aand Decisions 1, 1 (1984), 205--237.
[20]
W. Dai, G. Xue, Q. Yang, and Y. Yu. 2007a. Co-clustering based classification for out-of-domain documents. In Proceedings of Knowledge Discovery and Data Mining. New York, NY, 210--219.
[21]
W. Dai, Q. Yang, Gui rong Xue, and Yong Yu. 2007b. Boosting for transfer learning. In Proc. of ICML. 193--200.
[22]
J. Demsar. 2006. Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research 7, 7 (2006), 1--30.
[23]
P. P. B. Eggermont and V. N. LaRiccia. 1998. On EM-like Algorithms for Minimum Distance Estimation. Unpublished manuscript, University of Delaware.
[24]
X. Fern and C. Brodley. 2004. Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of of International Conference on Machine Learning. 281--288.
[25]
G. Forestier, P. Gançarski, and C. Wemmert. 2010. Collaborative clustering with background knowledge. Data Knowledge Engineering 69, 2 (February 2010), 211--228.
[26]
J. Gao, W. Fan, J. Jiang, and J. Han. 2008. Knowledge transfer via multiple model local structure mapping. In Proceedings of Knowledge Discovery and Data Mining. 283--291.
[27]
J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. 2009. Graph-based consensus maximization among multiple supervised and unsupervised models. In Proceedings of Neural Information Processing Systems. 1--9.
[28]
J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. 2013. A graph-based consensus maximization approach for combining multiple supervised and unsupervised models. IEEE Transactions on Knowledge and Data Engineering 25, 1, 15--28.
[29]
J. Ghosh and A. Acharya. 2011. Cluster ensembles. WIREs Data Mining and Knowledge Discovery 1 (2011), 1--12.
[30]
A. Gunawardana and W. Byrne. 2005. Convergence theorems for generalized alternating minimization procedures. Journal of Machine Learning Research 6 (2005), 2049--2073.
[31]
T. Joachims. 1999a. Making large-scale SVM learning practical. In Advances in Kernel Methods: Support Vector Learning, C. Burges B. Scholkopf and A. Smola (Eds.). MIT Press, Cambridge, MA, 169--184.
[32]
T. Joachims. 1999b. Transductive inference for text classification using support vector machines. In Proceedings of International Conference on Machine Learning. 200--209.
[33]
T. Joachims. 2003. Transductive learning via spectral graph partitioning. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003). 290--297.
[34]
S. Kumar, J. Ghosh, and M. M. Crawford. 2001. Best-bases feature extraction algorithms for classification of hyperspectral data. IEEE TGRS 39, 7 (2001), 1368--79.
[35]
L. I. Kuncheva. 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken, NJ.
[36]
N. C. Oza and K. Tumer. 2008. Classifier ensembles: Select real-world applications. Information Fusion 9, 1 (Jan. 2008), 4--20.
[37]
S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (2010), 1345--1359.
[38]
R. Polikar. 2007. Bootstrap-inspired techniques in computational intelligence. IEEE Signal Processing Magazine 24, 4, 59--72.
[39]
K. Punera and J. Ghosh. 2008. Consensus based ensembles of soft clusterings. In Applied Artificial Intelligence, Vol. 22. 109--117.
[40]
S. Rajan, J. Ghosh, and M. M. Crawford. 2006. Exploiting class Hierarchies for knowledge transfer in hyperspectral data. IEEE TGRS 44, 11 (2006), 3408--3417.
[41]
D. L. Silver and K. P. Bennett. 2008. Guest editor’s introduction: Special issue on inductive transfer learning. Machine Learning 73, 3 (Dec. 2008), 215--220. Issue 3.
[42]
V. Sindhwani and S. S. Keerthi. 2006. Large scale semi-supervised linear SVMs. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. NY, 477--484.
[43]
K. Sridharan and S. M. Kakade. 2008. An information theoretic framework for multi-view learning. In COLT. 403--414.
[44]
A. Strehl and J. Ghosh. 2002a. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3 (Dec. 2002), 583--617.
[45]
A. Strehl and J. Ghosh. 2002b. Cluster ensembles—A knowledge reuse framework for combining partitionings. In Proceedings of AAAI 2002, Edmonton, Canada. AAAI, 93--98.
[46]
A. Subramanya and J. A. Bilmes. 2009. Entropic graph regularization in non-parametric semi-supervised classification. In Proceedings of Neural Information Processing Systems. Vancouver, Canada, 1803--1811.
[47]
A. Subramanya and J. Bilmes. 2011. Semi-supervised learning with measure propagation. Journal of Machine Learning Research 12 (2011), 3311--3370.
[48]
S. Thrun and L. Y. Pratt. 1997. Learning To Learn. Kluwer Academic, Norwell, MA.
[49]
K. Tsuda. 2005. Propagating distributions on a hypergraph by dual information regularization. In Proceedings of the 22nd International Conference on Machine Learning (ICML’05). ACM, New York, NY, USA, 920--927.
[50]
K. Tumer and J. Ghosh. 1996. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29, 2 (1996), 341--348.
[51]
H. Wang, H. Shan, and A. Banerjee. 2009. Bayesian cluster ensembles. In Proceedings of the 9th SIAM International Conference on Data Mining. 211--222.
[52]
H. Wang, H. Shan, and A. Banerjee. 2011. Bayesian cluster ensembles. Statistical Analysis and Data Mining 1 (Jan. 2011), 1--17.
[53]
J. Wang, T. Jebara, and S. Chang. 2008. Graph transduction via alternating minimization. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML’08), A’ Mccallum and S’ Roweis (Eds.). Omnipress, 1144--1151.
[54]
S. Wang and D. Schuurmans. 2003a. Learning continuous latent variable models with Bregman divergences. 2842 (2003), 190--204.
[55]
S. Wang and D. Schuurmans. 2003b. Learning latent variable models with Bregman divergences. In Proceedings of the IEEE International Symposium on Information Theory, 220--220.
[56]
M. Welling, R. S. Zemel, and G. E. Hinton. 2002. Self supervised boosting. In Advances in Neural Information Processing Systems 15. MIT Press, 665--672.
[57]
C. F. J. Wu. 1982. On the convergence properties of the EM algorithm. Annals of Statistics (1982).
[58]
S. Xie, W. Fan, and Philip S. Yu. 2012. An iterative and re-weighting framework for rejection and uncertainty resolution in crowdsourcing. (2012), 1--12.
[59]
W. Zangwill. 1969. Nonlinear Programming: A Unified Approach. Prentice-Hall International Series in Management, Englewood Cliffs, NJ.
[60]
T. Zhang, A. Popescul, and B. Dom. 2006. Linear prediction models with graph regularization for web-page categorization. In Proceedings of the 12th ACM SIGKDD. ACM, New York, NY, 821--826.
[61]
X. Zhu and Z. Ghahramani. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report. Carnegie Mellon University.
[62]
X. Zhu, Z. Ghahramani, and J. D. Lafferty. 2003. Semi-supervised learning using Gaussian fields and harmonic functions. In ICML, T. Fawcett, N. Mishra, T. Fawcett, and N. Mishra (Eds.). AAAI Press, 912--919.
[63]
X. Zhu and A. B. Goldberg. 2009. Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 9, Issue 1
October 2014
209 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2663598
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2014
Accepted: 01 December 2013
Revised: 01 June 2013
Received: 01 April 2012
Published in TKDD Volume 9, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Classification
  2. clustering
  3. ensembles
  4. semisupervised learning
  5. transductive learning
  6. transfer learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media