skip to main content
research-article

Scalable and Accurate Online Feature Selection for Big Data

Published: 03 December 2016 Publication History

Abstract

Feature selection is important in many big data applications. Two critical challenges closely associate with big data. First, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Second, big data applications call for highly scalable feature selection algorithms in an online manner such that each feature can be processed in a sequential scan. We present SAOLA, a <underline>S</underline>calable and <underline>A</underline>ccurate <underline>O</underline>n<underline>L</underline>ine <underline>A</underline>pproach for feature selection in this paper. With a theoretical analysis on bounds of the pairwise correlations between features, SAOLA employs novel pairwise comparison techniques and maintains a parsimonious model over time in an online manner. Furthermore, to deal with upcoming features that arrive by groups, we extend the SAOLA algorithm, and then propose a new group-SAOLA algorithm for online group feature selection. The group-SAOLA algorithm can online maintain a set of feature groups that is sparse at the levels of both groups and individual features simultaneously. An empirical study using a series of benchmark real datasets shows that our two algorithms, SAOLA and group-SAOLA, are scalable on datasets of extremely high dimensionality and have superior performance over the state-of-the-art feature selection methods.

References

[1]
Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification. Part I: Algorithms and empirical evaluation. Journal of Machine Learning Research 11 (2010), 171--234.
[2]
Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13 (2012), 27--66.
[3]
Zhiguang Chen, Yutong Lu, Nong Xiao, and Fang Liu. 2014. A hybrid memory built by SSD and DRAM to support in-memory big data analytics. Knowledge and Information Systems 41, 2 (2014), 335--354.
[4]
Manoranjan Dash and Huan Liu. 2003. Consistency-based search in feature selection. Artificial Intelligence 151, 1 (2003), 155--176.
[5]
Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7 (2006), 1--30.
[6]
George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3 (2003), 1289--1305.
[7]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736.
[8]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003), 1157--1182.
[9]
Steven C. H. Hoi, Jialei Wang, Peilin Zhao, and Rong Jin. 2012. Online feature selection for mining big data. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. ACM, 93--100.
[10]
Aleks Jakulin and Ivan Bratko. 2003. Analyzing attribute dependencies. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’03). Springer-Verlag, 229--240.
[11]
Kashif Javed, Mehreen Saeed, and Haroon A. Babri. 2014. The correctness problem: Evaluating the ordering of binary features in rankings. Knowledge and Information Systems 39, 3 (2014), 543--563.
[12]
Alexandros Kalousis, Julien Prados, and Melanie Hilario. 2007. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowledge and Information Systems 12, 1 (2007), 95--116.
[13]
Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324.
[14]
Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In Proceedings of International Conference on Machine Learning (ICML’ 95). 284--292.
[15]
Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics (1951), 79--86.
[16]
Bo Liu, Yanshan Xiao, S. Yu Philip, Zhifeng Hao, and Longbing Cao. 2014. An efficient orientation distance--based discriminative feature extraction method for multi-classification. Knowledge and Information Systems 39, 2 (2014), 409--433.
[17]
Huan Liu and Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 491--502.
[18]
Jose M. Peña. 2008. Learning Gaussian graphical models of gene networks with false discovery rate control. In Proceedings of the European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO’08). 165--176.
[19]
Jose M. Peña, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. 2007. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 45, 2 (2007), 211--232.
[20]
Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.
[21]
Simon Perkins and James Theiler. 2003. Online feature selection using grafting. In Proceedings of International Conference on Machine Learning (ICML’03). 592--599.
[22]
William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. 1996. Numerical Recipes in C. Vol. 2. Cambridge university press Cambridge.
[23]
Jonathon Shlens. 2014. Notes on Kullback-Leibler divergence and likelihood. arXiv preprint arXiv:1404.2000.
[24]
Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13 (2012), 1393--1434.
[25]
Mingkui Tan, Ivor W. Tsang, and Li Wang. 2014. Towards ultrahigh dimensional feature selection for big data. Journal of Machine Learning Research 15 (2014), 1371--1429.
[26]
Mingkui Tan, Li Wang, and Ivor W. Tsang. 2010. Learning sparse SVM for feature selection on very high dimensional datasets. In Proceedings of International Conference on Machine Learning (ICML’10). 1047--1054.
[27]
Robert Tibshirani. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267--288.
[28]
Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers, Key West, FL.
[29]
Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 1 (2006), 31--78.
[30]
De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of International Conference on Collaborative Computing (CollaborateCom’12). 40--49.
[31]
Jialei Wang, Peilin Zhao, Steven C. H. Hoi, and Rong Jin. 2014. Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering 26, 3 (2014), 698--710.
[32]
Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. 2000. Feature selection for SVMs. In Proceedings of Conference on Neural Information Processing Systems (NIPS’00), Vol. 12. 668--674.
[33]
Adam Woznica, Phong Nguyen, and Alexandros Kalousis. 2012. Model mining for robust feature selection. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’12). ACM, 913--921.
[34]
Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. 2013. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2013), 1178--1192.
[35]
Xindong Wu, Kui Yu, Hao Wang, and Wei Ding. 2010. Online streaming feature selection. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 1159--1166.
[36]
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2014. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26, 1 (2014), 97--107.
[37]
Jin Xiao, Yi Xiao, Annqiang Huang, Dunhu Liu, and Shouyang Wang. 2015. Feature-selection-based dynamic transfer ensemble model for customer churn prediction. Knowledge and Information Systems 43, 1 (2015), 29--51.
[38]
Kui Yu, Wei Ding, Dan A. Simovici, Hao Wang, Jian Pei, and Xindong Wu. 2015a. Classification with streaming features: An emerging-pattern mining approach. ACM Transactions on Knowledge Discovery from Data 9, 4 (2015), 30.
[39]
Kui Yu, Wei Ding, and Xindong Wu. 2016. LOFS: A library of online streaming feature selection. Knowledge-Based Systems 113 (2016), 1--3.
[40]
Kui Yu, Dawei Wang, Wei Ding, Jian Pei, David L. Small, Shafiqul Islam, and Xindong Wu. 2015b. Tornado forecasting with multiple Markov boundaries. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2237--2246.
[41]
Lei Yu, Chris Ding, and Steven Loscalzo. 2008. Stable feature selection via dense feature groups. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). ACM, 803--811.
[42]
Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research 5 (2004), 1205--1224.
[43]
Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.
[44]
Yiteng Zhai, Y. Ong, and I. Tsang. 2014. The emerging “big dimensionality”. IEEE Computational Intelligence Magazine 9, 3 (2014), 14--26.
[45]
Yiteng Zhai, Mingkui Tan, Ivor Tsang, and Yew Soon Ong. 2012. Discovering support and affiliated features from very high dimensions. In Proceedings of International Conference on Machine Learning (ICML’12). 1455--1462.
[46]
Xiangrong Zhang, Yudi He, Licheng Jiao, Ruochen Liu, Ji Feng, and Sisi Zhou. 2015. Scaling cut criterion-based discriminant analysis for supervised dimension reduction. Knowledge and Information Systems 43, 3 (2015), 633--655.
[47]
Zheng Zhao and Huan Liu. 2007. Searching for interacting features. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07), Vol. 7. 1156--1161.
[48]
Zheng Zhao, Lei Wang, Huan Liu, and Jieping Ye. 2013. On similarity preserving feature selection. IEEE Transactions on Knowledge and Data Engineering 25 (2013), 619--632.
[49]
Jing Zhou, Dean P. Foster, Robert A. Stine, and Lyle H. Ungar. 2006. Streamwise feature selection. Journal of Machine Learning Research 7 (2006), 1861--1885.
[50]
Tianyi Zhou, Dacheng Tao, and Xindong Wu. 2011. Manifold elastic net: A unified framework for sparse dimension reduction. Data Mining and Knowledge Discovery 22, 3 (2011), 340--371.

Cited By

View all

Index Terms

  1. Scalable and Accurate Online Feature Selection for Big Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 2
    May 2017
    419 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3017677
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 December 2016
    Accepted: 01 July 2016
    Revised: 01 March 2016
    Received: 01 December 2014
    Published in TKDD Volume 11, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Online feature selection
    2. big data
    3. extremely high dimensionality
    4. group features

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National 973 Program of China
    • Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China
    • PIMS Post-Doctoral Fellowship Award of the Pacific Institute for the Mathematical Sciences
    • National Natural Science Foundation of China
    • NSERC Discovery grant
    • State Key Program of National Natural Science of China
    • BCIC NRAS Team Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media