research-article

A Relational Framework for Classifier Engineering

Authors:

Benny Kimelfeld,

Christopher RéAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 43, Issue 3

Article No.: 11, Pages 1 - 36

https://doi.org/10.1145/3268931

Published: 30 October 2018 Publication History

Abstract

In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and, in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.

References

[1]

Varun Aggarwal and Sassoon Kosian. 2011. Feature Selection and Dimension Reduction Techniques in SAS.

[2]

Howard Aizenstein and Leonard Pitt. 1991. Exact learning of read-twice DNF formulas (extended abstract). In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 170--179.

Digital Library

[3]

Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang. 2013. Brainwash: A data system for feature engineering2. In Proceedings of the Sixth Biennial Conference on Innovative Data Systems Research. http://web.eecs.umich.edu/ michjc/papers/mythical_man.pdf.

[4]

Michael R. Anderson, Michael J. Cafarella, Yixing Jiang, Guan Wang, and Bochun Zhang. 2014. An integrated development environment for faster feature engineering. PVLDB 7, 13 (2014), 1657--1660. http://www.vldb.org/pvldb/vol7/p1657-anderson.pdf.

Digital Library

[5]

Dana Angluin and Donna K. Slonim. 1994. Randomly fallible teachers: Learning monotone DNF with an incomplete membership oracle. Machine Learning 14, 1 (1994), 7--26.

Digital Library

[6]

Marta Arias and Roni Khardon. 2006. Complexity parameters for first order classes. Machine Learning 64, 1--3 (2006), 121--144.

Digital Library

[7]

Fahiem Bacchus, Adam J. Grove, Joseph Y. Halpern, and Daphne Koller. 2003. From statistical knowledge bases to degrees of belief. CoRR cs.AI/0307056 (2003). http://arxiv.org/abs/cs.AI/0307056.

Digital Library

[8]

Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. 2017. Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research 18 (2017), 109:1--109:67. http://jmlr.org/papers/v18/15-631.html.

Digital Library

[9]

Vince Bárány, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. 2017. Declarative probabilistic programming with datalog. ACM Trans. Database Syst. 42, 4 (2017), 22:1--22:35.

Digital Library

[10]

Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3 (2002), 463--482. http://www.jmlr.org/papers/v3/bartlett02a.html.

Digital Library

[11]

Yoshua Bengio and Aaron C. Courville. 2013. Deep learning of representations. In Handbook on Neural Information Processing, Monica Bianchini, Marco Maggini, and Lakhmi C. Jain (Eds.). Intelligent Systems Reference Library, Vol. 49. Springer, 1--28.

[12]

Avrim Blum, Merrick L. Furst, Jeffrey C. Jackson, Michael J. Kearns, Yishay Mansour, and Steven Rudich. 1994. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing. ACM, 253--262.

Digital Library

[13]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 1989. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM 36, 4 (1989), 929--965.

Digital Library

[14]

David E. Boyce. 1974. Optimal Subset Selection: Multiple Regression, Interdependence, and Optimal Network Algorithms. Springer-Verlag.

Digital Library

[15]

Ashok K. Chandra and Philip M. Merlin. 1977. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the 9th Annual ACM Symposium on Theory of Computing, John E. Hopcroft, Emily P. Friedman, and Michael A. Harrison (Eds.). ACM, 77--90.

Digital Library

[16]

Chandra Chekuri and Anand Rajaraman. 2000. Conjunctive query containment revisited. Theoretical Computer Science 239, 2 (2000), 211--229.

Digital Library

[17]

Sara Cohen and Yaacov Y. Weiss. 2016. The complexity of learning tree patterns from example graphs. ACM Trans. Database Syst. 41, 2 (2016), 14:1--14:44.

Digital Library

[18]

George Dantzig. 1998. Linear Programming and Extensions. Princeton University Press.

[19]

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In Proceedings of the Innovations in Theoretical Computer Science. ACM, 214--226.

Digital Library

[20]

Andrzej Ehrenfeucht, David Haussler, Michael J. Kearns, and Leslie G. Valiant. 1989. A general lower bound on the number of examples needed for learning. Information and Computation 82, 3 (1989), 247--261.

Digital Library

[21]

Ronald Fagin, Joseph Y. Halpern, and Nimrod Megiddo. 1990. A logic for reasoning about probabilities. Information and Computation 87, 1/2 (1990), 78--128.

Digital Library

[22]

Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2013. Spanners: A formal framework for information extraction. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’13). ACM, 37--48.

Digital Library

[23]

Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. Journal of the ACM 62, 2 (2015), 12.

Digital Library

[24]

Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. 1999. Learning probabilistic relational models. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI’99). Morgan Kaufmann, 1300--1309.

Digital Library

[25]

Alexander Gammerman, Katy S. Azoury, and Vladimir Vapnik. 1998. Learning by transduction. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI’99). Morgan Kaufmann, 148--155.

Digital Library

[26]

Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003), 1157--1182. http://www.jmlr.org/papers/v3/guyon03a.html.

Digital Library

[27]

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. 2006. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

Digital Library

[28]

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

[29]

George H. John, Ron Kohavi, and Karl Pfleger. 1994. Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 121--129.

Digital Library

[30]

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualiztion and Computer Graphics 18, 12 (2012), 2917--2926.

Digital Library

[31]

Benny Kimelfeld and Phokion G. Kolaitis. 2014. The complexity of mining maximal frequent subgraphs. ACM Transactions on Database Systems 39, 4 (2014), 32:1--32:33.

Digital Library

[32]

Benny Kimelfeld and Christopher Ré. 2017. A relational framework for classifier engineering. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’17). Chicago, IL. ACM, 5--20.

Digital Library

[33]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.

[34]

Erich Leo Lehmann and George Casella. 1998. Theory of Point Estimation. Vol. 31. Springer.

[35]

Brian Milch, Bhaskara Marthi, Stuart J. Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. 2005. BLOG: Probabilistic models with unknown objects. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI’97). Professional Book Center, 1352--1359.

Digital Library

[36]

Matthew Richardson and Pedro Domingos. 2006. Markov logic networks. Machine Learning 62, 1--2 (2006), 107--136.

Digital Library

[37]

Taisuke Sato and Yoshitaka Kameya. 1997. PRISM: A language for symbolic-statistical modeling. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI’97). Vol. 2. Morgan Kaufmann, 1330--1339.

Digital Library

[38]

Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

Digital Library

[39]

Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using DeepDive. PVLDB 8, 11 (2015), 1310--1321. http://www.vldb.org/pvldb/vol8/p1310-shin.pdf.

Digital Library

[40]

Leslie G. Valiant. 1984. A theory of the learnable. Commun. ACM 27, 11 (1984), 1134--1142.

Digital Library

[41]

Vladimir Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999.

Digital Library

[42]

Vladimir Vapnik, Esther Levin, and Yann LeCun. 1994. Measuring the VC-dimension of a learning machine. Neural Computation 6, 5 (1994), 851--876.

Digital Library

[43]

V. N. Vapnik and A. Ya. Chervonenkis. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16, 2 (1971), 264--280.

[44]

Martin J. Wainwright and Michael I. Jordan. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning 1, 1-2 (2008), 1--305.

Digital Library

[45]

Roberta S. Wenocur and Richard M. Dudley. 1981. Some special Vapnik-Chervonenkis classes. Discrete Mathematics 33, 3 (1981), 313--318.

Digital Library

[46]

Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases. Cannes, France. IEEE Computer Society, 82--94.

Digital Library

[47]

Richard S. Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In ICML (JMLR Proceedings), Vol. 28. JMLR.org, 325--333.

Digital Library

[48]

Ce Zhang, Arun Kumar, and Christopher RÃl’. 2016. Materialization optimizations for feature selection workloads. ACM Trans. Database Syst. 41, 1 (2016), 2:1--2:32.

Digital Library

[49]

Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. CoRR abs/1511.00148 (2015).

Cited By

Jung JLutz CPulcini HWolter F(2022)Logical separability of labeled data examples under ontologiesArtificial Intelligence10.1016/j.artint.2022.103785313(103785)Online publication date: Dec-2022
https://doi.org/10.1016/j.artint.2022.103785

Index Terms

A Relational Framework for Classifier Engineering

Recommendations

A Relational Framework for Classifier Engineering
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

In the design of analytical procedures and machine-learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this framework paper, we embark on the ...
Query processing over object views of relational data

This paper presents an approach to object view management for relational databases. Such a view mechanism makes it possible for users to transparently work with data in a relational database as if it was stored in an object-oriented (OO) database. A ...
Query Interoperation Among Object-Oriented and Relational Databases
ICDE '95: Proceedings of the Eleventh International Conference on Data Engineering

We develop an efficient algorithm for the query interoperation among existing heterogeneous object-oriented and relational databases. Our algorithm utilizes a canonical deductive database as a uniform representation of object-oriented schema and data. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 43, Issue 3

Best of PODS 2017, Best of ICDT 2017 and Regular Papers

September 2018

164 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/3284689

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2018

Accepted: 01 August 2018

Revised: 01 May 2018

Received: 01 October 2017

Published in TODS Volume 43, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

DEFT
MEMEX and SIMPLEX
DARPA's projects XDATA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
264
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jung JLutz CPulcini HWolter F(2022)Logical separability of labeled data examples under ontologiesArtificial Intelligence10.1016/j.artint.2022.103785313(103785)Online publication date: Dec-2022
https://doi.org/10.1016/j.artint.2022.103785

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents