skip to main content
research-article

DIFF: a relational interface for large-scale data explanation

Published: 01 December 2018 Publication History

Abstract

A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root cause analysis, today's explanation engines are designed as standalone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications, and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., 1995.
[2]
R. Agarwal, R. Srikant, et al. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.
[3]
M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis, D. Kumar, C. Lever, Z. Ma, J. Mason, D. Menscher, C. Seaman, N. Sullivan, K. Thomas, and Y. Zhou. Understanding the mirai botnet. In USENIX Security, 2017.
[4]
M. Armbrust et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394. ACM, 2015.
[5]
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD, volume 29, pages 261--272. ACM, 2000.
[6]
J. Ayres et al. Sequential pattern mining using a bitmap representation. In KDD, pages 429--435. ACM, 2002.
[7]
S. Babu, P. Bizarro, and D. DeWitt. Proactive re-optimization. In SIGMOD, pages 107--118. ACM, 2005.
[8]
P. Bailis et al. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556. ACM, 2017.
[9]
P. Bailis et al. Prioritizing attention in fast data: Principles and promise. CIDR Google Scholar, 2017.
[10]
E. Baralis, T. Cerquitelli, and S. Chiusano. Index support for frequent itemset mining in a relational dbms. In ICDE, pages 754--765. IEEE, 2005.
[11]
E. Baralis, T. Cerquitelli, and S. Chiusano. Imine: Index support for item set mining. IEEE Transactions on Knowledge and Data Engineering, 21(4):493--506, 2009.
[12]
R. G. Baraniuk. Compressive sensing {lecture notes}. IEEE signal processing magazine, 24(4):118--121, 2007.
[13]
Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165--1188, 2001.
[14]
M. Bittorf et al. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.
[15]
D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, pages 443--452. IEEE, 2001.
[16]
S. Chambi et al. Better bitmap performance with roaring bitmaps. Software: practice and experience, 46(5):709--719, 2016.
[17]
S. Chambi et al. Optimizing druid with roaring bitmaps. In IDEAS, pages 77--86. ACM, 2016.
[18]
S. Chaudhuri. An overview of query optimization in relational systems. In PODS, pages 34--43. ACM, 1998.
[19]
L. Chen et al. Towards linear algebra over normalized data. PVLDB, 10(11):1214--1225, 2017.
[20]
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56:74--80, 2013.
[21]
A. Deshpande et al. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007.
[22]
Z. Durumeric et al. The matter of heartbleed. In IMC, pages 475--488. ACM, 2014.
[23]
Z. Durumeric et al. A search engine backed by Internet-wide scanning. In SIGSAC, pages 542--553. ACM, 2015.
[24]
R. Fagin et al. Efficient implementation of large-scale multi-structural databases. In VLDB, pages 958--969. VLDB Endowment, 2005.
[25]
R. Fagin et al. Multi-structural databases. In PODS, pages 184--195. ACM, 2005.
[26]
W. Fang et al. Frequent itemset mining on graphics processors. In DaMoN, pages 34--42. ACM, 2009.
[27]
P. Fournier-Viger et al. The spmf open-source data mining library version 2. In Joint European conference on machine learning and knowledge discovery in databases, pages 36--40. Springer, 2016.
[28]
G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, pages 209--218. IEEE, 1993.
[29]
J. Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29--53, 1997.
[30]
A. Greenberg et al. The cost of a cloud: research problems in data center networks. ACM SIGCOMM computer communication review, 39(1):68--73, 2008.
[31]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157--1182, 2003.
[32]
M. A. Hall. Correlation-based feature selection of discrete and numeric class machine learning. 2000.
[33]
J. M. Hellerstein et al. Architecture of a database system. Foundations and Trends® in Databases, 1(2):141--259, 2007.
[34]
J. M. Hellerstein and M. Stonebraker. Readings in database systems. 2005.
[35]
S. C. Hoi et al. Online feature selection for mining big data. In BigMine, pages 93--100. ACM, 2012.
[36]
IDC. The digital universe of opportunities: Rich data and the increasing value of the internet of things, 2014. http://www.emc.com/leadership/digital-universe/.
[37]
I. F. Ilyas et al. Cords: automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658. ACM, 2004.
[38]
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results, volume 20. ACM, 1991.
[39]
N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598--609, 2012.
[40]
R. Kimball and M. Ross. The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons, 2011.
[41]
P. Konda et al. Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. PVLDB, 6(12):1306--1309, 2013.
[42]
A. Kumar. Learning Over Joins. PhD thesis, The University of Wisconsin-Madison, 2016.
[43]
A. Kumar et al. To join or not to join?: Thinking twice about joins before feature selection. In SIGMOD, pages 19--34. ACM, 2016.
[44]
A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1969--1984. ACM, 2015.
[45]
A. Lamb et al. The vertica analytic database: C-store 7 years later. VLDB, 5(12):1790--1801, 2012.
[46]
J. Leskovec et al. Mining of Massive Datasets. Cambridge university press, 2014.
[47]
H. Li et al. Pfp: parallel fp-growth for query recommendation. In RecSys, pages 107--114. ACM, 2008.
[48]
J. Li et al. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):94, 2017.
[49]
A. Meliou, S. Roy, and D. Suciu. Causality and explanations in databases. PVLDB, 7(13):1715--1716, 2014.
[50]
S. Melnik et al. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.
[51]
X. Meng et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235--1241, 2016.
[52]
T. Neumann and B. Radke. Adaptive optimization of very large join queries. 2018.
[53]
H. Q. Ngo et al. Worst-case optimal join algorithms. Journal of the ACM (JACM), 65(3):16, 2018.
[54]
P. O'Neil and D. Quass. Improved query performance with variant indexes. In SIGMOD, volume 26, pages 38--49. ACM, 1997.
[55]
A. Pagh and R. Pagh. Scalable computation of acyclic joins. In PODS, pages 225--232. ACM, 2006.
[56]
E. Rounds. A combined nonparametric approach to feature selection and binary decision tree design. Pattern Recognition, 12(5):313--317, 1980.
[57]
S. Roy et al. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1167--1178. IEEE, 2015.
[58]
S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579--1590. ACM, 2014.
[59]
G. Rupert Jr et al. Simultaneous statistical inference. Springer Science & Business Media, 2012.
[60]
Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507--2517, 2007.
[61]
S. Schuh, X. Chen, and J. Dittrich. An experimental comparison of thirteen relational equi-joins in main memory. In SIGMOD, pages 1961--1976. ACM, 2016.
[62]
P. G. Selinger et al. Access path selection in a relational database management system. In Readings in Artificial Intelligence and Databases, pages 511--522. Elsevier, 1988.
[63]
X. Shang, K.-U. Sattler, and I. Geist. Sql based frequent pattern mining with fp-growth. In Applications of Declarative Programming and Knowledge Management, pages 32--46. Springer, 2005.
[64]
M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, pages 553--564. VLDB Endowment, 2005.
[65]
X. Wang et al. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245. ACM, 2015.
[66]
D. E. Willard. Applications of range query theory to relational data base join and selection operations. journal of computer and system sciences, 52(1):157--169, 1996.
[67]
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013.
[68]
F. Yang et al. Druid: A real-time analytical data store. In SIGMOD, pages 157--168. ACM, 2014.
[69]
D. Y. Yoon, N. Niu, and B. Mozafari. Dbsherlock: A performance diagnostic tool for transactional databases. In SIGMOD, pages 1599--1614. ACM, 2016.
[70]
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. USENIX Association, 2012.
[71]
F. Zhang, Y. Zhang, and J. Bakos. Gpapriori: Gpu-accelerated frequent itemset mining. In Cluster Computing (CLUSTER), 2011 IEEE International Conference on, pages 590--594. IEEE, 2011.

Cited By

View all

Index Terms

  1. DIFF: a relational interface for large-scale data explanation
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 12, Issue 4
      December 2018
      140 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 December 2018
      Published in PVLDB Volume 12, Issue 4

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)26
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 14 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media