skip to main content
10.1145/3152494.3152496acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

A novel topic modeling based weighting framework for class imbalance learning

Published: 11 January 2018 Publication History

Abstract

Classification of data with imbalance characteristics has become an important research problem, as data from most of the real-world applications follow non-uniform class distributions. A simple solution to handle class imbalance is by sampling from the dataset appropriately to compensate for the imbalance in class proportions. When the data distribution is unknown during sampling, making assumptions on the distribution requires domain knowledge and insights on the dataset. We propose a novel unsupervised topic modeling based weighting framework to estimate the latent data distribution of a dataset. We also propose TODUS, a topics oriented directed undersampling algorithm that follows the estimated data distribution to draw samples from the dataset. TODUS minimizes the loss of important information that typically gets dropped during random undersampling. We have shown empirically that the performance of TODUS method is better than the other sampling methods compared in our experiments.

References

[1]
Astha Agrawal, Herna L. Viktor, and Eric Paquet. 2015. SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling. In KDIR, Ana L. N. Fred, Jan L. G. Dietz, David Aveiro, Kecheng Liu, and Joaquim Filipe (Eds.). SciTePress, 226--234. http://dblp.uni-trier.de/db/conf/ic3k/kdir2015.html#AgrawalVP15
[2]
Sukarna Barua, Md. Monirul Islam, Xin Yao, and Kazuyuki Murase. 2014. MWMOTE-Majority Weighted Minority Over-sampling Technique for Imbalanced Data Set Learning. IEEE Trans. Knowl. Data Eng. 26, 2 (2014), 405--425. http://dblp.uni-trier.de/db/journals/tkde/tkde26.html#BaruaIYM14
[3]
Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. 2004. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets 6, 1 (2004), 20--29.
[4]
D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (January 2003), 993--1022. http://www.cs.berkeley.edu/~blei/papers/blei03a.ps.gz; http://www.bibsonomy.org/bibtex/21d86d39e0f44b3fa45ff97800b5fa9e8/megmed
[5]
Paula Branco, Luís Torgo, and Rita P. Ribeiro. 2016. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput. Surv. 49, 2 (2016), 31:1--31:50. http://dblp.uni-trier.de/db/journals/csur/csur49.html#BrancoTR16
[6]
N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16 (2002), 321--357. http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a.pdf
[7]
Chris Drummond and R.C. Holte. 2003. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II (2003), 1--8. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.6858
[8]
Xingyu Gao, Zhenyu Chen, Sheng Tang, Yongdong Zhang, and Jintao Li. 2016. Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173 (2016), 1927--1935. http://dblp.uni-trier.de/db/journals/ijon/ijon173.html#GaoCTZL16
[9]
Haixiang Guo, Yijing Li, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73 (2017), 220--239. http://dblp.uni-trier.de/db/journals/eswa/eswa73.html#GuoLSMYB17
[10]
Hui Han, Wenyuan Wang, and Binghuan Mao. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In ICIC (1) (2009-04-01) (Lecture Notes in Computer Science), De-Shuang Huang, Xiao-Ping Zhang, and Guang-Bin Huang (Eds.), Vol. 3644. Springer, 878--887. http://dblp.uni-trier.de/db/conf/icic/icic2005-1.html#HanWM05
[11]
Haibo He and Yunqian Ma. 2013. Imbalanced Learning: Foundations, Algorithms, and Applications (1st ed.). Wiley-IEEE Press.
[12]
Thomas Hofmann. 1998. Unsupervised Learning from Dyadic Data. MIT Press, 466--472.
[13]
Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 289--296.
[14]
Robert C. Holte, Liane Acker, and Bruce W. Porter. 1989. Concept Learning and the Problem of Small Disjuncts. In IJCAI, N. S. Sridharan (Ed.). Morgan Kaufmann, 813--818. http://dblp.uni-trier.de/db/conf/ijcai/ijcai89.html#HolteAP89
[15]
Young-Min Kim, Jean-François Pessiot, Massih-Reza Amini, and Patrick Gallinari. 2008. An extension of PLSA for document clustering. In CIKM (2008-11-10), James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowdhury (Eds.). ACM, 1345--1346. http://dblp.uni-trier.de/db/conf/cikm/cikm2008.html#KimPAG08
[16]
Miroslav Kubat, Robert C. Holte, and Stan Matwin. 1998. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 2-3 (1998), 195--215. http://dblp.uni-trier.de/db/journals/ml/ml30.html#KubatHM98
[17]
Miroslav Kubat and Stan Matwin. 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In In Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann, 179--186. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.4487
[18]
Jorma Laurikkala. 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution. In AIME (Lecture Notes in Computer Science), Silvana Quaglini, Pedro Barahona, and Steen Andreassen (Eds.), Vol. 2101. Springer, 63--66. http://dblp.uni-trier.de/db/conf/aime/aime2001.html#Laurikkala01; http://www.bibsonomy.org/bibtex/299ad2efa02d1ffb29dced2ee0d3a23b4/dblp
[19]
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1--5. http://jmlr.org/papers/v18/16--365
[20]
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2006. Exploratory Under-Sampling for Class-Imbalance Learning. In ICDM. IEEE Computer Society, 965--969. http://dblp.uni-trier.de/db/conf/icdm/icdm2006.html#LiuWZ06
[21]
David Mease, Aj Wyner, and a Buja. 2007. Boosted classification trees and class probability/quantile estimation. The Journal of Machine Learning Research 8 (2007), 409--439. http://dl.acm.org/citation.cfm?id=1248675
[22]
Iman Nekooeimehr and Susana K. Lai-Yuen. 2016. Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46 (2016), 405--416. http://dblp.uni-trier.de/db/journals/eswa/eswa46.html#NekooeimehrL 16
[23]
Yuxin Peng. 2015. Adaptive Sampling with Optimal Cost for Class-Imbalance Learning. In AAAI, Blai Bonet and Sven Koenig (Eds.). AAAI Press, 2921--2927. http://dblp.uni-trier.de/db/conf/aaai/aaai2015.html#Peng15
[24]
Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of Population Structure Using Multi-locus Genotype Data. Genetics 155 (June 2000), 945--959. http://pritch.bsd.uchicago.edu/publications/structure.pdf
[25]
Muhammad Atif Tahir, Josef Kittler, and Fei Yan. 2012. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition 45, 10 (2012), 3738--3750. http://dblp.uni-trier.de/db/journals/pr/pr45.html#TahirKY12
[26]
Yuchun Tang and Yan-Qing Zhang. 2006. Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction. In GrC. IEEE, 457--460. http://dblp.uni-trier.de/db/conf/grc/grc2006.html#TangZ06
[27]
I. Tomek. 1976. Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics 7(2) (1976), 679--772.
[28]
Show-Jane Yen and Yue-Shi Lee. 2009. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36, 3 (2009), 5718--5727. http://dblp.uni-trier.de/db/journals/eswa/eswa36.html#YenL09
[29]
J. Zhang and I. Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets. http://www.bibsonomy.org/bibtex/2cf4d2ac8bdac874b3d4841b4645a5a90/diana
[30]
Weiwei Zong, Guang-Bin Huang, and Yiqiang Chen. 2013. Weighted extreme learning machine for imbalance learning. Neurocomputing 101 (2013), 229--242. http://dblp.uni-trier.de/db/journals/ijon/ijon101.html#ZongHC13; http://www.bibsonomy.org/bibtex/28207a6ccea04eab1f69459b673524f93/dblp

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data
January 2018
379 pages
ISBN:9781450363419
DOI:10.1145/3152494
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. class imbalance learning
  2. data distribution estimation
  3. directed undersampling
  4. topic modeling

Qualifiers

  • Research-article

Conference

CoDS-COMAD '18

Acceptance Rates

CODS-COMAD '18 Paper Acceptance Rate 50 of 150 submissions, 33%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media