skip to main content
10.5555/2976456.2976532guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Correcting sample selection bias by unlabeled data

Published: 04 December 2006 Publication History

Abstract

We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.

References

[1]
G. Casella and R. Berger. Statistical Inference. Duxbury, Pacific Grove, CA, 2nd edition, 2002.
[2]
M. Dudik, R.E. Schapire, and S.J. Phillips. Correcting sample selection bias in maximum entropy density estimation. In Advances in Neural Information Processing Systems 17, 2005.
[3]
A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for the two-sample-problem. In NIPS. MIT Press, 2006.
[4]
S. Gruvberger, M. Ringner, Y. Chen, S. Panavally, L.H. Saal, C. Peterson A. Borg, M. Ferno, and P.S. Meltzer. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Research, 61, 2001.
[5]
J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153-161, 1979.
[6]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13-30, 1963.
[7]
J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. Technical report, CS-2006-44, University of Waterloo, 2006.
[8]
Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classification in nonstandard situations. Machine Learning, 46:191-202, 2002.
[9]
S. Rosset, J. Zhu, H. Zou, and T. Hastie. A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in Neural Information Processing Systems 17, 2004.
[10]
M. Schmidt and H. Gish. Speaker identification via support vector classifiers. In Proc. ICASSP '96, pages 105-108, Atlanta, GA, May 1996.
[11]
B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443-1471, 2001.
[12]
H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 2000.
[13]
D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo, A. Renshaw, A. DAmico, and J. Richie. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2), 2002.
[14]
I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67-93, 2002.
[15]
I. Steinwart. Support vector machines are universally consistent. J. Compl., 18:768-791, 2002.
[16]
M. Sugiyama and K.-R. Müller. Input-dependent estimation of generalization error under covariate shift. Statistics and Decisions, 23:249-279, 2005.
[17]
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 2005.
[18]
P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6:265, Nov 2005.
[19]
M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson Jr, J.R. Marks, and J.R. Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS, 98(20), 2001.
[20]
B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In International Conference on Machine Learning ICML'04, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'06: Proceedings of the 20th International Conference on Neural Information Processing Systems
December 2006
1632 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 04 December 2006

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media