research-article

Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality

Authors:

Christos Anagnostopoulos,

Peter TriantafillouAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 11, Issue 4

Article No.: 47, Pages 1 - 46

https://doi.org/10.1145/3059177

Published: 29 June 2017 Publication History

Abstract

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches.

References

[1]

Milton Abramowitz. 1974. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover Publications, Incorporated.

Digital Library

[2]

Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinlander, Matthias J. Sax, Sebastian Schelter, Mareike Hoger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939--964.

Digital Library

[3]

Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proceedings of the VLDB Endowment 7, 14 (October 2014), 1905--1916.

Digital Library

[4]

Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In Proceedings of the IEEE International Conference on Data Mining (ICDM’15). 691--696.

Digital Library

[5]

Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately COUNT with query-driven predictive analytics. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). 14--23.

Digital Library

[6]

Atanas Atanasov, Madhusudhanan Srinivasan, and Tobias Weinzierl. 2012. Query-driven parallel exploration of large datasets. In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization, (LDAV’12). 23--30.

[7]

Natasha Balac, Tamara B. Sipes, Nicole Wolter, Kenneth Nunes, Robert S. Sinkovits, and Homa Karimabadi. 2013. Large scale predictive analytics for real-time energy management. In Proceedings of the 2013 IEEE International Conference on Big Data. 657--664.

[8]

Stefan Berchtold, Christian Bohm, Daniel A. Keim, and Hans-Peter Kriegel. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’97). ACM, New York, NY, 78--86.

Digital Library

[9]

Dimitri P. Bertsekas. 2005. Dynamic Programming and Optimal Control. Volume I. Athena Scientific, Belmont, MA.

Digital Library

[10]

Leon Bottou and Yoshua Bengio. 1994. Convergence properties of the K-means algorithms. In Advances in Neural Information Processing Systems 7, NIPS Conference, Denver, Colorado, USA. 585--592.

Digital Library

[11]

Olivier Bousquet and Leon Bottou. 2008. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 161--168.

Digital Library

[12]

Gail A. Carpenter and Stephen Grossberg. 1988. The ART of adaptive pattern recognition by a self-organizing neural network. IEEE Computer 21, 3 (1988), 77--88.

Digital Library

[13]

Kaushik Chakrabarti, Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’00). Morgan Kaufmann Publishers, Inc., San Francisco, CA, 111--122.

Digital Library

[14]

Abon Chaudhuri, Tzu-Hsuan Wei, Teng-Yok Lee, Han-Wei Shen, and Tom Peterka. 2014. Efficient range distribution query for visualizing scientific data. In Proceedings of the IEEE Pacific Visualization Symposium, (PacificVis’14). 201--208.

Digital Library

[15]

Yuan Shih Chow, Herbert Ellis Robbins, and David Siegmund. 1971. Great Expectations: The Theory of Optimal Stopping. Houghton Mifflin, Boston.

[16]

Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (January 2012), 1--294.

Digital Library

[17]

D. A. Darling, T. Liggett, and H. M. Taylor. 1972. Optimal stopping for partial sums. The Annals of Mathematical Statistics 43, 4 (1972), 1363--1368.

[18]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107--113.

Digital Library

[19]

Luke J. Gosink, Christoph Garth, John C. Anderson, E. Wes Bethel, and Kenneth I. Joy. 2011. An application of multivariate statistical analysis for query-driven visualization. IEEE Transactions on Visualization and Computer Graphics 17, 3 (2011), 264--275.

Digital Library

[20]

Mihajlo Grbovic and Slobodan Vucetic. 2009. Regression learning vector quantization. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM’09). 788--793.

Digital Library

[21]

Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.

Digital Library

[22]

Gus W. Haggstrom. 1967. Optimal sequential procedures when more than one stop is required. The Annals of Mathematical Statistics 38, 6 (1967), 1618--1626.

[23]

Elena Ikonomovska, Joao Gama, and Savso Dvzeroski. 2010. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 23, 1 (2010), 128--168.

Digital Library

[24]

John E. Dennis Jr. and Robert B. Schnabel. 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall.

Digital Library

[25]

Alan F. Karr. 2010. Secure statistical analysis of distributed databases, emphasizing what we don’t know. Journal of Privacy and Confidentiality 1, 2 (2010), 197--211.

[26]

Tuevo Kohonen. 1989. Self-Organization and Associative Memory: 3rd Edition. Springer-Verlag New York, Inc., New York, NY.

Digital Library

[27]

Teuvo Kohonen. 2013. Essentials of the self-organizing map. Neural Networks 37 (January 2013), 52--65.

Digital Library

[28]

Jong-Seok Lee, Cheol Hoon Park, and Touradj Ebrahimi. 2013. Theory and Applications of Hybrid Simulated Annealing. Springer Berlin Heidelberg, 395--422.

[29]

Lester E. Dubins and Henry Teicher. 1967. Optimal stopping when the future is discounted. The Annals of Mathematical Statistics 38, 2 (1967), 601--605.

[30]

Shoumei Li and Jinping Zhang. 2005. A general method for convergence theorems of fuzzy set-valued random variables and its applications to martingales and uniform amarts. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 13, 3 (2005), 243--254.

[31]

Moshe Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.

[32]

Chieh-Yen Lin, Cheng-Hao Tsai, Ching-Pei Lee, and Chih-Jen Lin. 2014. Large-scale logistic regression and linear support vector machines using spark. In Proceedings of the 2014 IEEE International Conference on Big Data, (Big Data’14), Washington, DC, USA, October 27--30. 519--528.

[33]

Junshui Ma, James Theiler, and Simon Perkins. 2003. Accurate on-line support vector regression. Neural Computing 15, 11 (November 2003), 2683--2703.

Digital Library

[34]

James B. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman (Eds.), Vol. 1. University of California Press, 281--297.

[35]

Scott C. Newton, Surya Pemmaraju, and Sunanda Mitra. 1992. Adaptive fuzzy leader clustering of complex data sets in pattern recognition. IEEE Transactions on Neural Networks 3, 5 (1992), 794--800.

Digital Library

[36]

Frank Olken and Doron Rotem. 1990. Random sampling from database files: A survey. In Proceedings of the 5th International Conference on Statistical and Scientific Database Management (SSDBM’90). Springer-Verlag, London, UK, 92--111.

Digital Library

[37]

Goran Peskir and Albert Shiryaev. 2006. Optimal Stopping and Free-Boundary Problems. Springer, Dordrecht.

[38]

Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11 (December 2010), 2487--2531.

Digital Library

[39]

Timothy T. Rogers and James L. McClelland. 2014. Parallel distributed processing at 25: Further explorations in the microstructure of cognition. Cognitive Science 38 (2014), 1024--1077.

[40]

Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox. 1992. Vector quantization by deterministic annealing. IEEE Transactions on Information Theory 38, 4 (1992), 1249--1257.

Digital Library

[41]

David A. Shnidman. 1989. The calculation of the probability of detection and the generalized Marcum Q-function. IEEE Transactions on Information Theory 35, 2 (1989), 389--400.

Digital Library

[42]

Utkarsh Srivastava, Peter J. Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. 2006. ISOMER: Consistent histogram construction using query feedback. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06, Atlanta, GA, USA), April 3--8. 39.

Digital Library

[43]

Yin Sun, Árpád Baricz, and Shidong Zhou. 2010. On the monotonicity, log-concavity, and tight bounds of the generalized Marcum and Nuttall Q-functions. IEEE Transactions on Information Theory 56, 3 (2010), 1166--1186.

Digital Library

[44]

Yufei Tao, Christos Faloutsos, and Dimitris Papadias. 2003. The power-method: A comprehensive estimation technique for multi-dimensional queries. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM’03). ACM, New York, NY, 83--90.

Digital Library

[45]

Hien To, Kuorong Chiang, and Cyrus Shahabi. 2013. Entropy-based histograms for selectivity estimation. In Proceedings of the 22nd ACM International Conference on Information 8 Knowledge Management (CIKM’13). ACM, New York, NY, 1939--1948.

Digital Library

[46]

Vladimir N. Vapnik and Alexey Ya. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264--280.

[47]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, NY, Article 5, 16 pages.

Digital Library

[48]

Raajay Viswanathan, Prateek Jain, Srivatsan Laxman, and Arvind Arasu. 2011. A learning framework for self-tuning histograms. CoRR abs/1111.7295 (2011). http://arxiv.org/abs/1111.7295.

[49]

Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 1 (March 1985), 37--57.

Digital Library

[50]

Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc.

Digital Library

[51]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, 2.

Digital Library

Cited By

Koukosias AAnagnostopoulos CKolomvatsos K(2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485531
Zakrzewicz M(2024)Learning From User-Specified Optimizer Hints in Database SystemsFoundations of Computing and Decision Sciences10.2478/fcds-2024-001149:2(181-197)Online publication date: 26-May-2024
https://doi.org/10.2478/fcds-2024-0011
Aladwani TAnagnostopoulos CKolomvatsos K(2024)Node and relevant data selection in distributed predictive analytics: A query-centric approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104029232(104029)Online publication date: Dec-2024
https://doi.org/10.1016/j.jnca.2024.104029
Show More Cited By

Index Terms

Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality

Recommendations

Missing Data Problem in Predictive Analytics
ICSCA '19: Proceedings of the 2019 8th International Conference on Software and Computer Applications

A revolution in computational methods and statistics to process and analyse data into insight and knowledge is along with the growth of data. The paradigm of data analytic is changed from explicit to implicit raises the way to extract knowledge from ...
An efficient predictive analytics system for high dimensional big data
Abstract
The excessive growth of high dimensional big data has resulted in a greater challenge for data scientists to efficiently obtain valuable knowledge from these data. Traditional data mining techniques are not fit to process big data. ...
Survey of the Different Type of Data Analytics Algorithms

From as early as 1854 to today, society has been gathering, processing, transforming, modeling and visualizing data to help drive data-driven decisions. The qualitative definition of big data can be defined more conclusively as data that has high volume,...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 11, Issue 4

Special Issue on KDD 2016 and Regular Papers

November 2017

419 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3119906

Editor:
Jie Tang
Tsinghua University, China

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2017

Accepted: 01 March 2017

Revised: 01 December 2016

Received: 01 May 2016

Published in TKDD Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
332
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)5

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koukosias AAnagnostopoulos CKolomvatsos K(2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485531
Zakrzewicz M(2024)Learning From User-Specified Optimizer Hints in Database SystemsFoundations of Computing and Decision Sciences10.2478/fcds-2024-001149:2(181-197)Online publication date: 26-May-2024
https://doi.org/10.2478/fcds-2024-0011
Aladwani TAnagnostopoulos CKolomvatsos K(2024)Node and relevant data selection in distributed predictive analytics: A query-centric approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104029232(104029)Online publication date: Dec-2024
https://doi.org/10.1016/j.jnca.2024.104029
Kraska T(2021)Towards instance-optimized data systemsProceedings of the VLDB Endowment10.14778/3476311.347639214:12(3222-3232)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476392
Zaghloul MSalem MAli-Eldin A(2021)A new framework based on features modeling and ensemble learning to predict query performancePLOS ONE10.1371/journal.pone.025843916:10(e0258439)Online publication date: 18-Oct-2021
https://doi.org/10.1371/journal.pone.0258439
Marcus RNegi PMao HTatbul NAlizadeh MKraska TLi GLi ZIdreos SSrivastava D(2021)Bao: Making Learned Query Optimization PracticalProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452838(1275-1288)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452838
Wang YXiao CQin JMao ROnizuka MWang WZhang RIshikawa YLi GLi ZIdreos SSrivastava D(2021)Consistent and Flexible Selectivity Estimation for High-Dimensional DataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452772(2319-2327)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452772
Savva FAnagnostopoulos CTriantafillou P(2020)SuRF: Identification of Interesting Data Regions with Surrogate Models2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00118(1321-1332)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00118
Anagnostopoulos C(2020)Edge-centric inferential modeling & analyticsJournal of Network and Computer Applications10.1016/j.jnca.2020.102696164(102696)Online publication date: Aug-2020
https://doi.org/10.1016/j.jnca.2020.102696
Shehab NAnagnostopoulos C(2020)Big Data Thinning: Knowledge Discovery from Relevant DataConvergence of Artificial Intelligence and the Internet of Things10.1007/978-3-030-44907-0_11(259-297)Online publication date: 7-May-2020
https://doi.org/10.1007/978-3-030-44907-0_11
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents