research-article

Query-centric regression

Authors:

Peter TriantafillouAuthors Info & Claims

Volume 104, Issue C

https://doi.org/10.1016/j.is.2021.101736

Published: 01 February 2022 Publication History

Abstract

Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.

Highlights

•

Studies the impedance mismatch problem between regression models and DBMS analytics.

•

A query-centric view and solution ensuring near-optimal performance for each query.

•

New metrics for evaluating query-centric performance of regression models for DBMS.

References

[1]

Meng X., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D., Amde M., Owen S., et al., Mllib: Machine learning in apache spark, J. Mach. Learn. Res. 17 (1) (2016) 1235–1241.

[2]

Meng X., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D., Amde M., Owen S., et al., Mllib: Machine learning in apache spark, J. Mach. Learn. Res. 17 (34) (2016) 1–7.

[3]

J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, C. Welton, Mad skills: new analysis practices for big data, in: Proc. VLDB Endow.

[4]

Hellerstein J.M., Ré C., Schoppmann F., Wang D.Z., Fratkin E., Gorajek A., Ng K.S., Welton C., Feng X., Li K., et al., The madlib analytics library: or mad skills, the sql, Proc. VLDB Endow. 5 (12) (2012) 1700–1711.

[5]

Ré C., Agrawal D., Balazinska M., Cafarella M., Jordan M., Kraska T., Ramakrishnan R., Machine learning and databases: The sound of things to come or a cacophony of hype?, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 283–284.

[6]

N.P.T. Condie, P. Mineiro, M. Weimer, Machine learning for big data, in: ACM SIGMOD, 2013, pp. 939–942.

[7]

Cai Z., Gao Z.J., Luo S., Perez L.L., Vagena Z., Jermaine C., A comparison of platforms for implementing and running very large scale machine learning algorithms, in: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, 2014, pp. 1371–1382.

[8]

Huang B., Boehm M., Tian Y., Reinwald B., Tatikonda S., Reiss F.R., Resource elasticity for large-scale machine learning, in: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp. 137–152.

[9]

S. Idreos, O. Papaemmanouil, S. Chaudhuri, Overview of data exploration techniques, in: Proceeding of ACM SIGMOD, 2015.

[10]

A. Wasay, X. Wei, N. Dayan, S. Idreos, Data canopy: Accelerating exploratory statistical analysis, in: Proceeding of ACM SIGMOD, 2017.

[11]

Muehleisen H., Damico A., Lumley T., Monetdb.r, 2018, http://monetr.r-forge.r-project.org/.

[12]

Gerhardt L., Faham C., Yao Y., Scidb-py, 2018, http://scidb-py.readthedocs.io/en/stable/.

[13]

Varrazzo D., Psycopg, 2014, http://initd.org/psycopg/.

[14]

C. Anagnostopoulos, P. Triantafillou, Learning set cardinality in distance nearest neighbours, in: Proceeding of IEEE International Conference on Data Mining, (ICDM15), 2015.

[15]

C. Anagnostopoulos, P. Triantafillou, Learning to accurately count with query-driven predictive analytics, in: Proceeding of IEEE International Conference on Big Data, 2015.

[16]

C. Anagnostopoulos, P. Triantafillou, Query-driven learning for predictive analytics of data subspace cardinality, ACM Trans. on Knowledge Discovery from Data, (ACM TKDD).

[17]

Y. Park, A.S. Tajik, M. Cafarella, B. Mozafari, Database learning: Toward a database that becomes smarter every time, in: Proceeding of ACM SIGMOD, 2017.

[18]

D.V. Aken, A. Pavlo, G.J. Gordon, B. Zhang, Automatic database management system tuning through large-scale machine learning, in: Proceeding of ACM SIGMOD, 2017.

[19]

Ma L., Van Aken D., Hefny A., Mezerhane G., Pavlo A., Gordon G.J., Query-based workload forecasting for self-driving database management systems, in: Proceedings of the 2018 International Conference on Management of Data, ACM, 2018, pp. 631–645.

[20]

D. Crankshaw, P. Bailis, J. Gonzalez, H. Li, Z. Zhang, M. Franklin, A. Ghodsi, M. Jordan, The missing piece in complex analytics: Low latency, scalable model management and serving with velox, in: Conference on Innovative Data Systems Research (CIDR), 2015.

[21]

D. Crankshaw, X. Wang, G. Zhou, M.J. Franklin, J.E. Gonzalez, I. Stoica, Clipper: A low-latency online prediction serving system, in: 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 2017, pp. 613–627.

[22]

Kumar A., McCann R., Naughton J., Patel J.M., Model selection management systems: The next frontier of advanced analytics, SIGMOD Rec. 44 (2016) 17–22.

[23]

Anon,Xleratordb, 2018, URL http://westclintech.com/.

[24]

Database pl/sql packages and types reference, 2005, URL https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/u_nla.htm#CIABEFIJ.

[25]

D. Wong, Oracle data miner, Oracle White Paper, 2013.

[26]

D.S. TKach, Information mining with the ibm intelligent miner, IBM White Paper, 1998.

[27]

A. Deshpande, S. Madden, Mauvedb: supporting model-based user views in database systems, in: ACM SIGMOD, 2006.

[28]

Ordonez C., Statistical model computation with udfs, IEEE Trans. Knowl. Data Eng. 22 (12) (2010) 1752–1765.

[29]

Ordonez C., Garcia-Alvarado C., Baladandayuthapani V., Bayesian variable selection in linear regression in one pass for large datasets, ACM Trans. Knowl. Discov. Data 9 (1) (2014) 3.

[30]

A. Thiagarajan, S. Madden, Querying continuous functions in a database system, in: ACM SIGMOD, 2008.

[31]

M. Schleich, D. Olteanu, R. Ciucanu, Learning linear regression models over factorized joins, in: ACM SIGMOD, 2016.

[32]

Ma Q., Triantafillou P., Dbest: Revisiting approximate query processing engines with machine learning models, in: Proceedings of the 2019 International Conference on Management of Data, ACM, 2019, pp. 1553–1570.

[33]

Q. Ma, A.M. Shanghooshabad, M. Kurmanji, M. Almasi, P. Triantafillou, Learned approximate query processing: Make it light, accurate and fast, in: Proceedings of the Conference on Innovative Data Systems Research (CIDR).

[34]

Aly A.M., Mahmood A.R., Hassan M.S., Aref W.G., Ouzzani M., Elmeleegy H., Qadah T., Aqwa: adaptive query workload aware partitioning of big spatial data, Proc. VLDB Endow. 8 (13) (2015) 2062–2073.

[35]

G. Koloniari, Y. Petrakis, E. Pitoura, T. Tsotsos, Query workload-aware overlay construction using histograms, in: Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 640–647.

[36]

Liang X., Zou T., Guo B., Li S., Zhang H., Zhang S., Huang H., Chen S.X., Assessing beijing’s pm2. 5 pollution: severity, weather impact, apec and winter heating, in: Proc. R. Soc. A, Vol. 471, The Royal Society, 2015.

[37]

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: Machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.

[38]

Freedman D.A., Statistical Models: Theory and Practice, Cambridge University Press, 2009.

[39]

O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook.

[40]

Meyer D., Leisch F., Hornik K., The support vector machine under test, Neurocomputing 55 (1–2) (2003) 169–186.

[41]

Altman N.S., An introduction to kernel and nearest-neighbor nonparametric regression, Amer. Statist. 46 (3) (1992) 175–185.

[42]

Bishop C.M., Pattern Recognition and Machine Learning, Springer, 2006.

Digital Library

[43]

Sikora R., et al., A modified stacking ensemble machine learning algorithm using genetic algorithms, in: Handbook of Research on Organizational Transformations Through Big Data Analytics, IGI Global, 2015, pp. 43–53.

[44]

Breiman L., Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.

[45]

Freund Y., Schapire R.E., A desicion-theoretic generalization of on-line learning and an application to boosting, in: European Conference on Computational Learning Theory, Springer, 1995, pp. 23–37.

[46]

Friedman J.H., Greedy function approximation: a gradient boosting machine, Ann. Stat. (2001) 1189–1232.

[47]

Wolpert D.H., Stacked generalization, Neural Netw. 5 (2) (1992) 241–259.

Digital Library

[48]

Jacobs R.A., Jordan M.I., Nowlan S.J., Hinton G.E., Adaptive mixtures of local experts, Neural Comput. 3 (1) (1991) 79–87.

[49]

Chen T., Singh S., Taskar B., Guestrin C., Efficient second-order gradient boosting for conditional random fields, in: Artificial Intelligence and Statistics, 2015, pp. 147–155.

[50]

Chen T., Guestrin C., Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.

[51]

Vinayak R.K., Gilad-Bachrach R., Dart: Dropouts meet multiple additive regression trees, in: Artificial Intelligence and Statistics, 2015, pp. 489–497.

[52]

P. Dugan, W. Cukierski, Y. Shiu, A. Rahaman, C. Clark, Kaggle Competition, Cornell University, The ICML.

[53]

J. Bennett, S. Lanning, et al. The netflix prize, in: Proceedings of KDD Cup and Workshop, 2007, New York, NY, USA, 2007, p. 35.

[54]

Tavallaee M., Bagheri E., Lu W., Ghorbani A.A., A detailed analysis of the kdd cup 99 data set, in: Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on, IEEE, 2009, pp. 1–6.

[55]

Lichman M., UCI machine learning repository, 2013, URL http://archive.ics.uci.edu/ml.

[56]

Nambiar R.O., Poess M., The making of tpc-ds, in: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB Endowment, 2006, pp. 1049–1058.

[57]

Deneke T., Haile H., Lafond S., Lilius J., Video transcoding time prediction for proactive load balancing, in: 2014 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2014, pp. 1–6.

[58]

Fernandes K., Vinagre P., Cortez P., A proactive intelligent decision support system for predicting the popularity of online news, in: Portuguese Conference on Artificial Intelligence, Springer, 2015, pp. 535–546.

[59]

Tüfekci P., Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, Int. J. Electr. Power Energy Syst. 60 (2014) 126–140.

[60]

T. Bertin-Mahieux, D.P. Ellis, B. Whitman, P. Lamere, The million song dataset.

[61]

Burgués J., Jiménez-Soto J.M., Marco S., Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models, Anal. Chim. Acta 1013 (2018) 13–25.

[62]

Cochran W.G., Sampling Techniques, John Wiley & Sons, 2007.

[63]

Cohen J., Statistical Power Analysis for the Behavioral Sciences, Routledge, 2013.

[64]

Kelley K., Maxwell S.E., Sample size for multiple regression: obtaining regression coefficients that are accurate, not simply significant, Psychol. Methods 8 (3) (2003) 305.

[65]

Jiroutek M.R., Muller K.E., Kupper L.L., Stewart P.W., A new method for choosing sample size for confidence interval–based inferences, Biometrics 59 (3) (2003) 580–590.

[66]

Maxwell S.E., Kelley K., Rausch J.R., Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol. 59 (2008) 537–563.

[67]

Dobbin K.K., Simon R.M., Sample size planning for developing classifiers using high-dimensional dna microarray data, Biostatistics 8 (1) (2006) 101–117.

[68]

Beleites C., Neugebauer U., Bocklitz T., Krafft C., Popp J., Sample size planning for classification models, Anal. Chim. Acta 760 (2013) 25–33.

Index Terms

Query-centric regression

Index terms have been assigned to the content through auto-classification.

Recommendations

High gain amplifier with feedforward compensation based on quasi-floating gate transistors

In this paper, a two-stage amplifier with feedforward frequency compensation scheme is presented. Because the frequency compensation scheme uses the amplifier's second stage gm to create the feedforward path no additional circuitry is needed. To verify ...
A 12.1bit-ENOB noise shaping SAR ADC for biosensor applications
Abstract
This paper proposes a low power fully-passive noise shaping successive approximation register analog-to-digital converter (SAR ADC) for biosensor applications. The proposed ADC includes a second-order fully-passive integrator achieves ...
A precision programmable multilevel voltage output and low-temperature-variation CMOS bandgap reference with area-efficient transistor-array layout
Abstract
A full-MOSFET low temperature-coefficient (TC) and low power-supply-rejection-ratio (PSRR) bandgap reference (BGR) circuit with 6-bit high-precision programmable multilevel voltage output is presented, which is applicable for low-frequency human ...
Highlights
A pint-size, low TC and PSRR feature, 64-multilevel high-precision output BGR with 5 mV resolution and flexible programmability is proposed targeting to low-power and low-frequency human bioelectric signal sensing application.
- We propose a ...

Comments

Information & Contributors

Information

Published In

cover image Information Systems

Information Systems Volume 104, Issue C

Feb 2022

467 pages

ISSN:0306-4379

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 01 February 2022

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents