research-article

Scaling out big data missing value imputations: pythia vs. godzilla

Authors:

Christos Anagnostopoulos,

Peter TriantafillouAuthors Info & Claims

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 651 - 660

https://doi.org/10.1145/2623330.2623615

Published: 24 August 2014 Publication History

Get Access

Abstract

Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.

Supplementary Material

MP4 File (p651-sidebyside.mp4)

Download
207.98 MB

References

[1]

X. Su, et al, ‘Using Classifier-Based Nominal Imputation to Improve Machine Learning', Proc. 15th PAKDD, Part I, LNAI 6634, pp. 124--135, 2011.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Imputations of missing values using a tracking-removed autoencoder trained with incomplete data

Empirical comparison of supervised learning techniques for missing value imputation

A Novel Fuzzy Rough Clustering Parameter-based missing value imputation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations