skip to main content
10.1145/2623330.2623615acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Scaling out big data missing value imputations: pythia vs. godzilla

Published: 24 August 2014 Publication History

Abstract

Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.

Supplementary Material

MP4 File (p651-sidebyside.mp4)

References

[1]
X. Su, et al, ‘Using Classifier-Based Nominal Imputation to Improve Machine Learning', Proc. 15th PAKDD, Part I, LNAI 6634, pp. 124--135, 2011.
[2]
A. Farhangfar, et al, ‘Impact of imputation of missing values on classification error for discrete data', Pattern Recognition, 41(12): 3692--3705, Dec 2008.
[3]
M.T. Asif, et al, ‘Low--Dimensional Models for Missing Data Imputation in Road Networks', Proc. 38th IEEE ICASSP, pp.3527--3531, 2013.
[4]
E.C. Chi, et al, ‘Genotype imputation via matrix completion', Genome Research, 23(3):509--18, Mar 2013.
[5]
I.B. Aydilek, et al, ‘A novel hybrid appoach to estimating missing values in databases using k--nearest neighbors and neural networks', Innovative Computing, Information and Control, 8(7A): 1349--4198, Jul 2012.
[6]
A. Farhangfar, et al, ‘A Novel Framework for Imputation of Missing Values in Databases', IEEE Trans. Sys. Man Cyber. (A), 37(5): 692--709, Sep 2007.
[7]
K. Lakshminarayan, et al, 'Imputation of missing data in industrial databases', Appl. Intell., 11(3): 259--275, Nov / Dec 1999.
[8]
L. A. Kurgan, et al, 'Mining the cystic fibrosis data', J. Zurada & M. Kantardzic (Eds.), Next Generation of Data--Mining Applications, IEEE Press, 415--444, 2005.
[9]
A.W. Liew, et al, 'Missing value imputation for gene expression data: computational techniques to recover missing data from available information', Brief. Bioinform., 12(5): 498--513, Sep 2011.
[10]
J. Dean, et al, 'MapReduce: Simplified Data Processing on Large Clusters', Proc. USENIX OSDI, 2004.
[11]
S. Ghemawat, et al, 'The Google File System', Proc. ACM SOSP, 2003.
[12]
C-T. Chu, et al, 'Map-Reduce for Machine Learning on Multicore', NIPS 19, MIT press, 281--288, 2006.
[13]
C. K. Enders, 'Applied Missing Data Analysis', Guilford Press, NY, 2010.
[14]
D. W. Joenssen, et al, 'Hot Deck Methods for Imputing Missing Data', Proc. 8th MLDM, LNCS 7376, pp.63--75, 2012.
[15]
O. Troyanskaya, et al, 'Missing value estimation methods for DNA microarrays', Bioinformatics, 17(6):520--525, 2001.
[16]
R.J. Little, et al, 'Statistical Analysis with Missing Data', Wiley, NY, 1987.
[17]
T.E. Raghunathan, et al, 'A multivariate technique for multiply imputing missing values using a sequence of regression models', Survey Methodology, 27(1):85--95, 2001.
[18]
D.B. Rubin, 'Multiple Imputation After 18
[19]
Years', J. of the American Statistical Association, 91(434):473--489, 1996.
[20]
L. Li, et al, 'DynaMMo: mining and summarization of coevolving sequences with missing values', Proc. 15th KDD, 527--534, 2009.
[21]
S. Yang, et al, 'Online recovery of missing values in vital signs data streams using low--rank matrix completion', Proc. 11th IEEE ICMLA, 281--287, 2012.
[22]
M. Ouyang, et al, 'Gaussian mixture clustering and imputation of microarray data', Bioinformatics, 20(6): 917--923, Apr 2004.
[23]
T. Aittokallio, et al, 'Dealing with missing values in large-scale studies: microarray data imputation and beyond' Brief. Bioinform. 11(2):253--264, 2010.
[24]
D-W. Kim, et al, 'Iterative Clustering Analysis for Grouping Missing Data in Gene Expression Profiles', Proc. PAKDD 2006, LNAI 3918, pp.129--138, 2006.
[25]
M.R. Garey, et al, 'Computers and Intractability; A Guide to the Theory of NP--Completeness', W. H. Freeman & Co., NY, 1990.
[26]
B. Przydatek, 'A fast approximation algorithm for the subset--sum problem', Intl. Trans. in Op. Res., 9(4): 437--459, Jul 2002.
[27]
G. A. Carpenter, et al, 'The ART of adaptive pattern recognition by a self--organizing neural network', IEEE Computer, 21(3): 77--88, Mar 1988.
[28]
A. Ahmad, et al, 'A k--mean clustering algorithm for mixed numeric and categorical data' Data & Knowledge Engineering, 63(2):503--527, 2007.
[29]
P. J. Rousseeuw, et al, 'Alternatives to the median absolute deviation', J. American Statistical Association, 88(424): 1273--1283, Dec 1993.
[30]
H. Belbachir, et al, 'Sums involving moments of reciprocals of binomial coefficients', J. Integer Sequences, 14(6), Article 11.6.6, 16p, 2011.
[31]
J-H. Yang, et al, 'The asymptotic expansions of certain sums involving inverse of binomial coefficient', Intl. Mathematical Forum, 5(16): 761--768, 2010.
[32]
J.L. Bentley, 'Multidimensional binary search trees used for associative searching', Communications of the ACM, 18(9):509--517, 1975.
[33]
K. Bache, et al, UCI Machine Learning Repository {http://archive.ics.uci.edu/ml} Irvine, Uni. of California, School of Inform. and Comp. Sci., 2013.

Cited By

View all

Index Terms

  1. Scaling out big data missing value imputations: pythia vs. godzilla

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2014
      2028 pages
      ISBN:9781450329569
      DOI:10.1145/2623330
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 August 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. big data
      2. clustering
      3. missing value

      Qualifiers

      • Research-article

      Funding Sources

      • European Social Fund
      • Greek National Funds National Strategic Reference Framework (NSRF) Research Funding Program: Thales

      Conference

      KDD '14
      Sponsor:

      Acceptance Rates

      KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media