skip to main content
research-article

A Relational Framework for Classifier Engineering

Published: 30 October 2018 Publication History

Abstract

In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and, in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.

References

[1]
Varun Aggarwal and Sassoon Kosian. 2011. Feature Selection and Dimension Reduction Techniques in SAS.
[2]
Howard Aizenstein and Leonard Pitt. 1991. Exact learning of read-twice DNF formulas (extended abstract). In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 170--179.
[3]
Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A data system for feature engineering2. In Proceedings of the Sixth Biennial Conference on Innovative Data Systems Research. http://web.eecs.umich.edu/ michjc/papers/mythical_man.pdf.
[4]
Michael R. Anderson, Michael J. Cafarella, Yixing Jiang, Guan Wang, and Bochun Zhang. 2014. An integrated development environment for faster feature engineering. PVLDB 7, 13 (2014), 1657--1660. http://www.vldb.org/pvldb/vol7/p1657-anderson.pdf.
[5]
Dana Angluin and Donna K. Slonim. 1994. Randomly fallible teachers: Learning monotone DNF with an incomplete membership oracle. Machine Learning 14, 1 (1994), 7--26.
[6]
Marta Arias and Roni Khardon. 2006. Complexity parameters for first order classes. Machine Learning 64, 1--3 (2006), 121--144.
[7]
Fahiem Bacchus, Adam J. Grove, Joseph Y. Halpern, and Daphne Koller. 2003. From statistical knowledge bases to degrees of belief. CoRR cs.AI/0307056 (2003). http://arxiv.org/abs/cs.AI/0307056.
[8]
Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2017. Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research 18 (2017), 109:1--109:67. http://jmlr.org/papers/v18/15-631.html.
[9]
Vince Bárány, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. 2017. Declarative probabilistic programming with datalog. ACM Trans. Database Syst. 42, 4 (2017), 22:1--22:35.
[10]
Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3 (2002), 463--482. http://www.jmlr.org/papers/v3/bartlett02a.html.
[11]
Yoshua Bengio and Aaron C. Courville. 2013. Deep learning of representations. In Handbook on Neural Information Processing, Monica Bianchini, Marco Maggini, and Lakhmi C. Jain (Eds.). Intelligent Systems Reference Library, Vol. 49. Springer, 1--28.
[12]
Avrim Blum, Merrick L. Furst, Jeffrey C. Jackson, Michael J. Kearns, Yishay Mansour, and Steven Rudich. 1994. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing. ACM, 253--262.
[13]
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 1989. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM 36, 4 (1989), 929--965.
[14]
David E. Boyce. 1974. Optimal Subset Selection: Multiple Regression, Interdependence, and Optimal Network Algorithms. Springer-Verlag.
[15]
Ashok K. Chandra and Philip M. Merlin. 1977. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the 9th Annual ACM Symposium on Theory of Computing, John E. Hopcroft, Emily P. Friedman, and Michael A. Harrison (Eds.). ACM, 77--90.
[16]
Chandra Chekuri and Anand Rajaraman. 2000. Conjunctive query containment revisited. Theoretical Computer Science 239, 2 (2000), 211--229.
[17]
Sara Cohen and Yaacov Y. Weiss. 2016. The complexity of learning tree patterns from example graphs. ACM Trans. Database Syst. 41, 2 (2016), 14:1--14:44.
[18]
George Dantzig. 1998. Linear Programming and Extensions. Princeton University Press.
[19]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In Proceedings of the Innovations in Theoretical Computer Science. ACM, 214--226.
[20]
Andrzej Ehrenfeucht, David Haussler, Michael J. Kearns, and Leslie G. Valiant. 1989. A general lower bound on the number of examples needed for learning. Information and Computation 82, 3 (1989), 247--261.
[21]
Ronald Fagin, Joseph Y. Halpern, and Nimrod Megiddo. 1990. A logic for reasoning about probabilities. Information and Computation 87, 1/2 (1990), 78--128.
[22]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2013. Spanners: A formal framework for information extraction. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’13). ACM, 37--48.
[23]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. Journal of the ACM 62, 2 (2015), 12.
[24]
Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. 1999. Learning probabilistic relational models. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI’99). Morgan Kaufmann, 1300--1309.
[25]
Alexander Gammerman, Katy S. Azoury, and Vladimir Vapnik. 1998. Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI’99). Morgan Kaufmann, 148--155.
[26]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003), 1157--1182. http://www.jmlr.org/papers/v3/guyon03a.html.
[27]
Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[28]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
[29]
George H. John, Ron Kohavi, and Karl Pfleger. 1994. Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 121--129.
[30]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualiztion and Computer Graphics 18, 12 (2012), 2917--2926.
[31]
Benny Kimelfeld and Phokion G. Kolaitis. 2014. The complexity of mining maximal frequent subgraphs. ACM Transactions on Database Systems 39, 4 (2014), 32:1--32:33.
[32]
Benny Kimelfeld and Christopher Ré. 2017. A relational framework for classifier engineering. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’17). Chicago, IL. ACM, 5--20.
[33]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
[34]
Erich Leo Lehmann and George Casella. 1998. Theory of Point Estimation. Vol. 31. Springer.
[35]
Brian Milch, Bhaskara Marthi, Stuart J. Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. 2005. BLOG: Probabilistic models with unknown objects. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI’97). Professional Book Center, 1352--1359.
[36]
Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine Learning 62, 1--2 (2006), 107--136.
[37]
Taisuke Sato and Yoshitaka Kameya. 1997. PRISM: A language for symbolic-statistical modeling. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI’97). Vol. 2. Morgan Kaufmann, 1330--1339.
[38]
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
[39]
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using DeepDive. PVLDB 8, 11 (2015), 1310--1321. http://www.vldb.org/pvldb/vol8/p1310-shin.pdf.
[40]
Leslie G. Valiant. 1984. A theory of the learnable. Commun. ACM 27, 11 (1984), 1134--1142.
[41]
Vladimir Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999.
[42]
Vladimir Vapnik, Esther Levin, and Yann LeCun. 1994. Measuring the VC-dimension of a learning machine. Neural Computation 6, 5 (1994), 851--876.
[43]
V. N. Vapnik and A. Ya. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16, 2 (1971), 264--280.
[44]
Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning 1, 1-2 (2008), 1--305.
[45]
Roberta S. Wenocur and Richard M. Dudley. 1981. Some special Vapnik-Chervonenkis classes. Discrete Mathematics 33, 3 (1981), 313--318.
[46]
Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases. Cannes, France. IEEE Computer Society, 82--94.
[47]
Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In ICML (JMLR Proceedings), Vol. 28. JMLR.org, 325--333.
[48]
Ce Zhang, Arun Kumar, and Christopher RÃl’. 2016. Materialization optimizations for feature selection workloads. ACM Trans. Database Syst. 41, 1 (2016), 2:1--2:32.
[49]
Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. CoRR abs/1511.00148 (2015).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 43, Issue 3
Best of PODS 2017, Best of ICDT 2017 and Regular Papers
September 2018
164 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3284689
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2018
Accepted: 01 August 2018
Revised: 01 May 2018
Received: 01 October 2017
Published in TODS Volume 43, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Feature engineering
  2. classifiers
  3. conjunctive queries
  4. machine learning
  5. relational databases

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • DEFT
  • MEMEX and SIMPLEX
  • DARPA's projects XDATA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media