skip to main content
10.1145/3448016.3457302acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

PGMJoins: Random Join Sampling with Graphical Models

Published: 18 June 2021 Publication History

Abstract

Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X).

Supplementary Material

MP4 File (3448016.3457302.mp4)
Modern databases face formidable challenges when called to join (several) big tables. Said joins are very time- and resource-consuming, join results are too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art [43] leaves lots of room for improvements and the general solution besides inefficient seems also complex and difficult to follow. With this paper we contribute a principled solution, coined . adapts Probabilistic Graphical Models to deriving provably random samples for (n-way) key-joins, many-to- many joins, and cyclic and acyclic joins. also contributes a novel Message Passing Protocol (MPP) to make a uniform sample of the joint distribution (join result) and a novel way to deal with cyclic joins. is shown to significantly outperform the state of the art (by up to ca. 2X26X) using queries and datasets from TPC-H, TPC-DS, and Twitter.

References

[1]
Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 275--286.
[2]
Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In 2015 IEEE international conference on data mining. IEEE, 691--696.
[3]
Christos Anagnostopoulos and Peter Triantafillou. 2017a. Efficient scalable accurate regression queries in in-dbms analytics. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 559--570.
[4]
Christos Anagnostopoulos and Peter Triantafillou. 2017b. Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 11, 4 (2017), 1--46.
[5]
Christopher M Bishop. 2006. Pattern recognition and machine learning .springer.
[6]
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, P Krishna Gummadi, et al. 2010. Measuring user influence in twitter: The million follower fallacy. Icwsm, Vol. 10, 10--17 (2010), 30.
[7]
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On random sampling over joins. ACM SIGMOD Record, Vol. 28, 2 (1999), 263--274.
[8]
C Chow and Cong Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, Vol. 14, 3 (1968), 462--467.
[9]
Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. 2006. Probabilistic networks and expert systems: Exact computational methods for Bayesian networks .Springer Science & Business Media.
[10]
Amol Deshpande and Sunita Sarawagi. 2007. Probabilistic graphical models and their role in databases. In Proceedings of the 33rd international conference on very large data bases. 1435--1436.
[11]
Kamel Garrouch and Mohamed Nazih Omri. 2017. Bayesian network based information retrieval model. In 2017 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 193--200.
[12]
Walter R Gilks and Pascal Wild. 1992. Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 41, 2 (1992), 337--348.
[13]
Peter J Haas. 1997. Large-sample and deterministic confidence intervals for online aggregation. In Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No. 97TB100150). IEEE, 51--62.
[14]
Peter J Haas and Joseph M Hellerstein. 1999. Ripple joins for online aggregation. ACM SIGMOD Record, Vol. 28, 2 (1999), 287--298.
[15]
Marios Hadjieleftheriou, Xiaohui Yu, Nikos Koudas, and Divesh Srivastava. 2008. Hashed samples: selectivity estimators for set similarity selection queries. Proceedings of the VLDB Endowment, Vol. 1, 1.
[16]
W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. (1970).
[17]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: Learn from Data, not from Queries! arXiv preprint arXiv:1909.00607 (2019).
[18]
Szymon Jaroszewicz and Tobias Scheffer. 2005. Fast discovery of unexpected patterns in data, relative to a bayesian network. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 118--127.
[19]
Chris Jermaine, Subramanian Arumugam, Abhijit Pol, and Alin Dobra. 2008. Scalable approximate query processing with the DBO engine. ACM Transactions on Database Systems (TODS), Vol. 33, 4 (2008), 1--54.
[20]
Robert E Kass, Bradley P Carlin, Andrew Gelman, and Radford M Neal. 1998. Markov chain Monte Carlo in practice: a roundtable discussion. The American Statistician, Vol. 52, 2 (1998), 93--100.
[21]
Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques .MIT press.
[22]
Arun Kumar, Jeffrey Naughton, and Jignesh M Patel. 2015. Learning generalized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1969--1984.
[23]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.
[24]
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615--629.
[25]
Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553--1570.
[26]
Qingzhi Ma and Peter Triantafillou. 2021. Learned approximate query processing: Make it Light, Accurate, and Fast. In Proceedings of the 2021 International Conference on Innovative Data Systems (CIDR) .
[27]
Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, Vol. 46, 253 (1951), 68--78.
[28]
Frank Olken. 1993. Random sampling from databases. Ph.D. Dissertation. University of California, Berkeley.
[29]
Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1461--1476.
[30]
Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. 587--602.
[31]
Judea Pearl. 1982. Reverend Bayes on inference engines: A distributed hierarchical approach .Cognitive Systems Laboratory, School of Engineering and Applied Science ?.
[32]
Marco Ramoni and Paola Sebastiani. 1997. Learning Bayesian networks from incomplete databases. In UAI, Vol. 97. 401--408.
[33]
Christian P Robert and George Casella. 1999. The Metropolis-Hastings Algorithm. In Monte Carlo Statistical Methods. Springer, 231--283.
[34]
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning linear regression models over factorized joins. In Proceedings of the 2016 International Conference on Management of Data. 3--18.
[35]
Sameer Singh and Thore Graepel. 2012. Compiling relational database schemata into probabilistic graphical models. arXiv preprint arXiv:1212.0967 (2012).
[36]
Joe Suzuki. 1993. A construction of Bayesian networks from databases based on an MDL principle. In Uncertainty in Artificial Intelligence. Elsevier, 266--273.
[37]
Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, and Gautam Das. 2019. Approximate query processing using deep generative models. arXiv preprint arXiv:1903.10000 (2019).
[38]
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. Proceedings of the VLDB Endowment, Vol. 4, 11 (2011), 852--863.
[39]
Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2013. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal, Vol. 22, 1 (2013), 3--27.
[40]
Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M Hellerstein. 2008. BayesStore: managing large, uncertain data repositories with probabilistic graphical models. Proceedings of the VLDB Endowment, Vol. 1, 1 (2008), 340--351.
[41]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, Vol. 14, 1 (2020), 61--73.
[42]
Nevin L Zhang and David Poole. 1994. A simple approach to Bayesian network computations. In Proc. of the Tenth Canadian Conference on Artificial Intelligence .
[43]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525--1539.

Cited By

View all

Index Terms

  1. PGMJoins: Random Join Sampling with Graphical Models

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. PGMs for data management
      2. join queries
      3. random sampling

      Qualifiers

      • Research-article

      Funding Sources

      • EPSRC (Engineering and Physical Sciences Research Council)

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)44
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media