research-article

PGMJoins: Random Join Sampling with Graphical Models

Authors:

Ali Mohammadi Shanghooshabad,

Peter TriantafillouAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1610 - 1622

https://doi.org/10.1145/3448016.3457302

Published: 18 June 2021 Publication History

Get Access

Abstract

Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X).

Supplementary Material

MP4 File (3448016.3457302.mp4)

Modern databases face formidable challenges when called to join (several) big tables. Said joins are very time- and resource-consuming, join results are too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art [43] leaves lots of room for improvements and the general solution besides inefficient seems also complex and difficult to follow. With this paper we contribute a principled solution, coined . adapts Probabilistic Graphical Models to deriving provably random samples for (n-way) key-joins, many-to- many joins, and cyclic and acyclic joins. also contributes a novel Message Passing Protocol (MPP) to make a uniform sample of the joint distribution (join result) and a novel way to deal with cyclic joins. is shown to significantly outperform the state of the art (by up to ca. 2X26X) using queries and datasets from TPC-H, TPC-DS, and Twitter.

Download
35.57 MB

References

[1]

Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 275--286.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Random Sampling over Joins Revisited

Distributed stream join query processing with semijoins

Computing Complex Temporal Join Queries Efficiently

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations