skip to main content
research-article

Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality

Published: 29 June 2017 Publication History

Abstract

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches.

References

[1]
Milton Abramowitz. 1974. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover Publications, Incorporated.
[2]
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinlander, Matthias J. Sax, Sebastian Schelter, Mareike Hoger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939--964.
[3]
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, and Till Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proceedings of the VLDB Endowment 7, 14 (October 2014), 1905--1916.
[4]
Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In Proceedings of the IEEE International Conference on Data Mining (ICDM’15). 691--696.
[5]
Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately COUNT with query-driven predictive analytics. In Proceedings of the IEEE International Conference on Big Data (Big Data’15). 14--23.
[6]
Atanas Atanasov, Madhusudhanan Srinivasan, and Tobias Weinzierl. 2012. Query-driven parallel exploration of large datasets. In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization, (LDAV’12). 23--30.
[7]
Natasha Balac, Tamara B. Sipes, Nicole Wolter, Kenneth Nunes, Robert S. Sinkovits, and Homa Karimabadi. 2013. Large scale predictive analytics for real-time energy management. In Proceedings of the 2013 IEEE International Conference on Big Data. 657--664.
[8]
Stefan Berchtold, Christian Bohm, Daniel A. Keim, and Hans-Peter Kriegel. 1997. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’97). ACM, New York, NY, 78--86.
[9]
Dimitri P. Bertsekas. 2005. Dynamic Programming and Optimal Control. Volume I. Athena Scientific, Belmont, MA.
[10]
Leon Bottou and Yoshua Bengio. 1994. Convergence properties of the K-means algorithms. In Advances in Neural Information Processing Systems 7, NIPS Conference, Denver, Colorado, USA. 585--592.
[11]
Olivier Bousquet and Leon Bottou. 2008. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 161--168.
[12]
Gail A. Carpenter and Stephen Grossberg. 1988. The ART of adaptive pattern recognition by a self-organizing neural network. IEEE Computer 21, 3 (1988), 77--88.
[13]
Kaushik Chakrabarti, Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2000. Approximate query processing using wavelets. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’00). Morgan Kaufmann Publishers, Inc., San Francisco, CA, 111--122.
[14]
Abon Chaudhuri, Tzu-Hsuan Wei, Teng-Yok Lee, Han-Wei Shen, and Tom Peterka. 2014. Efficient range distribution query for visualizing scientific data. In Proceedings of the IEEE Pacific Visualization Symposium, (PacificVis’14). 201--208.
[15]
Yuan Shih Chow, Herbert Ellis Robbins, and David Siegmund. 1971. Great Expectations: The Theory of Optimal Stopping. Houghton Mifflin, Boston.
[16]
Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4, 1--3 (January 2012), 1--294.
[17]
D. A. Darling, T. Liggett, and H. M. Taylor. 1972. Optimal stopping for partial sums. The Annals of Mathematical Statistics 43, 4 (1972), 1363--1368.
[18]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107--113.
[19]
Luke J. Gosink, Christoph Garth, John C. Anderson, E. Wes Bethel, and Kenneth I. Joy. 2011. An application of multivariate statistical analysis for query-driven visualization. IEEE Transactions on Visualization and Computer Graphics 17, 3 (2011), 264--275.
[20]
Mihajlo Grbovic and Slobodan Vucetic. 2009. Regression learning vector quantization. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM’09). 788--793.
[21]
Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, and Carlotta Domeniconi. 2005. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal 14, 2 (2005), 137--154.
[22]
Gus W. Haggstrom. 1967. Optimal sequential procedures when more than one stop is required. The Annals of Mathematical Statistics 38, 6 (1967), 1618--1626.
[23]
Elena Ikonomovska, Joao Gama, and Savso Dvzeroski. 2010. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery 23, 1 (2010), 128--168.
[24]
John E. Dennis Jr. and Robert B. Schnabel. 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall.
[25]
Alan F. Karr. 2010. Secure statistical analysis of distributed databases, emphasizing what we don’t know. Journal of Privacy and Confidentiality 1, 2 (2010), 197--211.
[26]
Tuevo Kohonen. 1989. Self-Organization and Associative Memory: 3rd Edition. Springer-Verlag New York, Inc., New York, NY.
[27]
Teuvo Kohonen. 2013. Essentials of the self-organizing map. Neural Networks 37 (January 2013), 52--65.
[28]
Jong-Seok Lee, Cheol Hoon Park, and Touradj Ebrahimi. 2013. Theory and Applications of Hybrid Simulated Annealing. Springer Berlin Heidelberg, 395--422.
[29]
Lester E. Dubins and Henry Teicher. 1967. Optimal stopping when the future is discounted. The Annals of Mathematical Statistics 38, 2 (1967), 601--605.
[30]
Shoumei Li and Jinping Zhang. 2005. A general method for convergence theorems of fuzzy set-valued random variables and its applications to martingales and uniform amarts. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 13, 3 (2005), 243--254.
[31]
Moshe Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[32]
Chieh-Yen Lin, Cheng-Hao Tsai, Ching-Pei Lee, and Chih-Jen Lin. 2014. Large-scale logistic regression and linear support vector machines using spark. In Proceedings of the 2014 IEEE International Conference on Big Data, (Big Data’14), Washington, DC, USA, October 27--30. 519--528.
[33]
Junshui Ma, James Theiler, and Simon Perkins. 2003. Accurate on-line support vector regression. Neural Computing 15, 11 (November 2003), 2683--2703.
[34]
James B. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman (Eds.), Vol. 1. University of California Press, 281--297.
[35]
Scott C. Newton, Surya Pemmaraju, and Sunanda Mitra. 1992. Adaptive fuzzy leader clustering of complex data sets in pattern recognition. IEEE Transactions on Neural Networks 3, 5 (1992), 794--800.
[36]
Frank Olken and Doron Rotem. 1990. Random sampling from database files: A survey. In Proceedings of the 5th International Conference on Statistical and Scientific Database Management (SSDBM’90). Springer-Verlag, London, UK, 92--111.
[37]
Goran Peskir and Albert Shiryaev. 2006. Optimal Stopping and Free-Boundary Problems. Springer, Dordrecht.
[38]
Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11 (December 2010), 2487--2531.
[39]
Timothy T. Rogers and James L. McClelland. 2014. Parallel distributed processing at 25: Further explorations in the microstructure of cognition. Cognitive Science 38 (2014), 1024--1077.
[40]
Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox. 1992. Vector quantization by deterministic annealing. IEEE Transactions on Information Theory 38, 4 (1992), 1249--1257.
[41]
David A. Shnidman. 1989. The calculation of the probability of detection and the generalized Marcum Q-function. IEEE Transactions on Information Theory 35, 2 (1989), 389--400.
[42]
Utkarsh Srivastava, Peter J. Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. 2006. ISOMER: Consistent histogram construction using query feedback. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06, Atlanta, GA, USA), April 3--8. 39.
[43]
Yin Sun, Árpád Baricz, and Shidong Zhou. 2010. On the monotonicity, log-concavity, and tight bounds of the generalized Marcum and Nuttall Q-functions. IEEE Transactions on Information Theory 56, 3 (2010), 1166--1186.
[44]
Yufei Tao, Christos Faloutsos, and Dimitris Papadias. 2003. The power-method: A comprehensive estimation technique for multi-dimensional queries. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM’03). ACM, New York, NY, 83--90.
[45]
Hien To, Kuorong Chiang, and Cyrus Shahabi. 2013. Entropy-based histograms for selectivity estimation. In Proceedings of the 22nd ACM International Conference on Information 8 Knowledge Management (CIKM’13). ACM, New York, NY, 1939--1948.
[46]
Vladimir N. Vapnik and Alexey Ya. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 2 (1971), 264--280.
[47]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, NY, Article 5, 16 pages.
[48]
Raajay Viswanathan, Prateek Jain, Srivatsan Laxman, and Arvind Arasu. 2011. A learning framework for self-tuning histograms. CoRR abs/1111.7295 (2011). http://arxiv.org/abs/1111.7295.
[49]
Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 1 (March 1985), 37--57.
[50]
Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc.
[51]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, 2.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 4
Special Issue on KDD 2016 and Regular Papers
November 2017
419 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3119906
  • Editor:
  • Jie Tang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2017
Accepted: 01 March 2017
Revised: 01 December 2016
Received: 01 May 2016
Published in TKDD Volume 11, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Predictive analytics
  2. analytics selection queries
  3. data subspace exploration
  4. optimal stopping theory
  5. predictive learning
  6. vector regression quantization

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)5
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media