skip to main content
research-article

Query-centric regression

Published: 01 February 2022 Publication History

Abstract

Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.

Highlights

Studies the impedance mismatch problem between regression models and DBMS analytics.
A query-centric view and solution ensuring near-optimal performance for each query.
New metrics for evaluating query-centric performance of regression models for DBMS.

References

[1]
Meng X., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D., Amde M., Owen S., et al., Mllib: Machine learning in apache spark, J. Mach. Learn. Res. 17 (1) (2016) 1235–1241.
[2]
Meng X., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D., Amde M., Owen S., et al., Mllib: Machine learning in apache spark, J. Mach. Learn. Res. 17 (34) (2016) 1–7.
[3]
J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, C. Welton, Mad skills: new analysis practices for big data, in: Proc. VLDB Endow.
[4]
Hellerstein J.M., Ré C., Schoppmann F., Wang D.Z., Fratkin E., Gorajek A., Ng K.S., Welton C., Feng X., Li K., et al., The madlib analytics library: or mad skills, the sql, Proc. VLDB Endow. 5 (12) (2012) 1700–1711.
[5]
Ré C., Agrawal D., Balazinska M., Cafarella M., Jordan M., Kraska T., Ramakrishnan R., Machine learning and databases: The sound of things to come or a cacophony of hype?, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 283–284.
[6]
N.P.T. Condie, P. Mineiro, M. Weimer, Machine learning for big data, in: ACM SIGMOD, 2013, pp. 939–942.
[7]
Cai Z., Gao Z.J., Luo S., Perez L.L., Vagena Z., Jermaine C., A comparison of platforms for implementing and running very large scale machine learning algorithms, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, 2014, pp. 1371–1382.
[8]
Huang B., Boehm M., Tian Y., Reinwald B., Tatikonda S., Reiss F.R., Resource elasticity for large-scale machine learning, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 137–152.
[9]
S. Idreos, O. Papaemmanouil, S. Chaudhuri, Overview of data exploration techniques, in: Proceeding of ACM SIGMOD, 2015.
[10]
A. Wasay, X. Wei, N. Dayan, S. Idreos, Data canopy: Accelerating exploratory statistical analysis, in: Proceeding of ACM SIGMOD, 2017.
[11]
Muehleisen H., Damico A., Lumley T., Monetdb.r, 2018, http://monetr.r-forge.r-project.org/.
[14]
C. Anagnostopoulos, P. Triantafillou, Learning set cardinality in distance nearest neighbours, in: Proceeding of IEEE International Conference on Data Mining, (ICDM15), 2015.
[15]
C. Anagnostopoulos, P. Triantafillou, Learning to accurately count with query-driven predictive analytics, in: Proceeding of IEEE International Conference on Big Data, 2015.
[16]
C. Anagnostopoulos, P. Triantafillou, Query-driven learning for predictive analytics of data subspace cardinality, ACM Trans. on Knowledge Discovery from Data, (ACM TKDD).
[17]
Y. Park, A.S. Tajik, M. Cafarella, B. Mozafari, Database learning: Toward a database that becomes smarter every time, in: Proceeding of ACM SIGMOD, 2017.
[18]
D.V. Aken, A. Pavlo, G.J. Gordon, B. Zhang, Automatic database management system tuning through large-scale machine learning, in: Proceeding of ACM SIGMOD, 2017.
[19]
Ma L., Van Aken D., Hefny A., Mezerhane G., Pavlo A., Gordon G.J., Query-based workload forecasting for self-driving database management systems, in: Proceedings of the 2018 International Conference on Management of Data, ACM, 2018, pp. 631–645.
[20]
D. Crankshaw, P. Bailis, J. Gonzalez, H. Li, Z. Zhang, M. Franklin, A. Ghodsi, M. Jordan, The missing piece in complex analytics: Low latency, scalable model management and serving with velox, in: Conference on Innovative Data Systems Research (CIDR), 2015.
[21]
D. Crankshaw, X. Wang, G. Zhou, M.J. Franklin, J.E. Gonzalez, I. Stoica, Clipper: A low-latency online prediction serving system, in: 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 2017, pp. 613–627.
[22]
Kumar A., McCann R., Naughton J., Patel J.M., Model selection management systems: The next frontier of advanced analytics, SIGMOD Rec. 44 (2016) 17–22.
[25]
D. Wong, Oracle data miner, Oracle White Paper, 2013.
[26]
D.S. TKach, Information mining with the ibm intelligent miner, IBM White Paper, 1998.
[27]
A. Deshpande, S. Madden, Mauvedb: supporting model-based user views in database systems, in: ACM SIGMOD, 2006.
[28]
Ordonez C., Statistical model computation with udfs, IEEE Trans. Knowl. Data Eng. 22 (12) (2010) 1752–1765.
[29]
Ordonez C., Garcia-Alvarado C., Baladandayuthapani V., Bayesian variable selection in linear regression in one pass for large datasets, ACM Trans. Knowl. Discov. Data 9 (1) (2014) 3.
[30]
A. Thiagarajan, S. Madden, Querying continuous functions in a database system, in: ACM SIGMOD, 2008.
[31]
M. Schleich, D. Olteanu, R. Ciucanu, Learning linear regression models over factorized joins, in: ACM SIGMOD, 2016.
[32]
Ma Q., Triantafillou P., Dbest: Revisiting approximate query processing engines with machine learning models, in: Proceedings of the 2019 International Conference on Management of Data, ACM, 2019, pp. 1553–1570.
[33]
Q. Ma, A.M. Shanghooshabad, M. Kurmanji, M. Almasi, P. Triantafillou, Learned approximate query processing: Make it light, accurate and fast, in: Proceedings of the Conference on Innovative Data Systems Research (CIDR).
[34]
Aly A.M., Mahmood A.R., Hassan M.S., Aref W.G., Ouzzani M., Elmeleegy H., Qadah T., Aqwa: adaptive query workload aware partitioning of big spatial data, Proc. VLDB Endow. 8 (13) (2015) 2062–2073.
[35]
G. Koloniari, Y. Petrakis, E. Pitoura, T. Tsotsos, Query workload-aware overlay construction using histograms, in: Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 640–647.
[36]
Liang X., Zou T., Guo B., Li S., Zhang H., Zhang S., Huang H., Chen S.X., Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating, in: Proc. R. Soc. A, Vol. 471, The Royal Society, 2015.
[37]
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[38]
Freedman D.A., Statistical Models: Theory and Practice, Cambridge University Press, 2009.
[39]
O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook.
[40]
Meyer D., Leisch F., Hornik K., The support vector machine under test, Neurocomputing 55 (1–2) (2003) 169–186.
[41]
Altman N.S., An introduction to kernel and nearest-neighbor nonparametric regression, Amer. Statist. 46 (3) (1992) 175–185.
[42]
Bishop C.M., Pattern Recognition and Machine Learning, Springer, 2006.
[43]
Sikora R., et al., A modified stacking ensemble machine learning algorithm using genetic algorithms, in: Handbook of Research on Organizational Transformations Through Big Data Analytics, IGI Global, 2015, pp. 43–53.
[44]
Breiman L., Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.
[45]
Freund Y., Schapire R.E., A desicion-theoretic generalization of on-line learning and an application to boosting, in: European Conference on Computational Learning Theory, Springer, 1995, pp. 23–37.
[46]
Friedman J.H., Greedy function approximation: a gradient boosting machine, Ann. Stat. (2001) 1189–1232.
[47]
Wolpert D.H., Stacked generalization, Neural Netw. 5 (2) (1992) 241–259.
[48]
Jacobs R.A., Jordan M.I., Nowlan S.J., Hinton G.E., Adaptive mixtures of local experts, Neural Comput. 3 (1) (1991) 79–87.
[49]
Chen T., Singh S., Taskar B., Guestrin C., Efficient second-order gradient boosting for conditional random fields, in: Artificial Intelligence and Statistics, 2015, pp. 147–155.
[50]
Chen T., Guestrin C., Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.
[51]
Vinayak R.K., Gilad-Bachrach R., Dart: Dropouts meet multiple additive regression trees, in: Artificial Intelligence and Statistics, 2015, pp. 489–497.
[52]
P. Dugan, W. Cukierski, Y. Shiu, A. Rahaman, C. Clark, Kaggle Competition, Cornell University, The ICML.
[53]
J. Bennett, S. Lanning, et al. The netflix prize, in: Proceedings of KDD Cup and Workshop, 2007, New York, NY, USA, 2007, p. 35.
[54]
Tavallaee M., Bagheri E., Lu W., Ghorbani A.A., A detailed analysis of the kdd cup 99 data set, in: Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on, IEEE, 2009, pp. 1–6.
[55]
Lichman M., UCI machine learning repository, 2013, URL http://archive.ics.uci.edu/ml.
[56]
Nambiar R.O., Poess M., The making of tpc-ds, in: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, 2006, pp. 1049–1058.
[57]
Deneke T., Haile H., Lafond S., Lilius J., Video transcoding time prediction for proactive load balancing, in: 2014 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2014, pp. 1–6.
[58]
Fernandes K., Vinagre P., Cortez P., A proactive intelligent decision support system for predicting the popularity of online news, in: Portuguese Conference on Artificial Intelligence, Springer, 2015, pp. 535–546.
[59]
Tüfekci P., Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, Int. J. Electr. Power Energy Syst. 60 (2014) 126–140.
[60]
T. Bertin-Mahieux, D.P. Ellis, B. Whitman, P. Lamere, The million song dataset.
[61]
Burgués J., Jiménez-Soto J.M., Marco S., Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models, Anal. Chim. Acta 1013 (2018) 13–25.
[62]
Cochran W.G., Sampling Techniques, John Wiley & Sons, 2007.
[63]
Cohen J., Statistical Power Analysis for the Behavioral Sciences, Routledge, 2013.
[64]
Kelley K., Maxwell S.E., Sample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant, Psychol. Methods 8 (3) (2003) 305.
[65]
Jiroutek M.R., Muller K.E., Kupper L.L., Stewart P.W., A new method for choosing sample size for confidence interval–based inferences, Biometrics 59 (3) (2003) 580–590.
[66]
Maxwell S.E., Kelley K., Rausch J.R., Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol. 59 (2008) 537–563.
[67]
Dobbin K.K., Simon R.M., Sample size planning for developing classifiers using high-dimensional dna microarray data, Biostatistics 8 (1) (2006) 101–117.
[68]
Beleites C., Neugebauer U., Bocklitz T., Krafft C., Popp J., Sample size planning for classification models, Anal. Chim. Acta 760 (2013) 25–33.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Information Systems
Information Systems  Volume 104, Issue C
Feb 2022
467 pages

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 February 2022

Author Tags

  1. 00-01
  2. 99-00

Author Tags

  1. Query-centric
  2. Regression for DBMSs
  3. Predictive analytics

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media