skip to main content
10.1145/3196959.3196990acmconferencesArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems

Published: 27 May 2018 Publication History

Abstract

Worst-case optimal join algorithms are the class of join algorithms whose runtime match the worst-case output size of a given join query. While the first provably worse-case optimal join algorithm was discovered relatively recently, the techniques and results surrounding these algorithms grow out of decades of research from a wide range of areas, intimately connecting graph theory, algorithms, information theory, constraint satisfaction, database theory, and geometric inequalities. These ideas are not just paperware: in addition to academic project implementations, two variations of such algorithms are the work-horse join algorithms of commercial database and data analytics engines. This paper aims to be a brief introduction to the design and analysis of worst-case optimal join algorithms. We discuss the key techniques for proving runtime and output size bounds. We particularly focus on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm. Finally, we conclude with a representative list of fundamental open problems in this area.

References

[1]
Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. EmptyHeaded: A Relational Engine for Graph Processing. ACM Trans. Database Syst. 42, 4 (2017), 20:1--20:44.
[2]
Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. {n. d.}. In-Database Learning with Sparse Tensors. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2018, Houston, TX, USA.
[3]
Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. AC/DC: In-Database Learning Thunderstruck. CoRR (2018). https://arxiv.org/abs/1803.07480
[4]
Mahmoud Abo Khamis, Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2016. Joins via Geometric Resolutions: Worst Case and Beyond. ACM Trans. Database Syst. 41, 4 (2016), 22:1--22:45.
[5]
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently, See {48}, 13--28.
[6]
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2017. Juggling Functions Inside a Database. SIGMOD Record 46, 1 (2017), 6--13.
[7]
Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. 2016. Computing Join Queries with Functional Dependencies, See {48}, 327--342.
[8]
Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. 2017. What Do Shannontype Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14--19, 2017, Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts (Eds.). ACM, 429--444.
[9]
Noga Alon. 1981. On the number of subgraphs of prescribed type of graphs with a given number of edges. Israel J. Math. 38, 1--2 (1981), 116--130.
[10]
Khaled Ammar, Frank McSherry, Semih Salihoglu, and Manas Joglekar. 2018. Distributed Evaluation of Subgraph Queries Using Worst-case Optimal and LowMemory Dataflows. PVLDB 11, 6 (2018), 691--704. http://www.vldb.org/pvldb/ vol11/p691-ammar.pdf
[11]
Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and Implementation of the LogicBlox System, See {62}, 1371--1382.
[12]
Albert Atserias, Martin Grohe, and Dániel Marx. 2008. Size Bounds and Query Plans for Relational Joins. In FOCS. IEEE Computer Society, 739--748.
[13]
Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker (Eds.). 2015. Readings in Database Systems, 5th Edition. http://www.redbook.io/
[14]
Paul Beame, Paraschos Koutris, and Dan Suciu. 2014. Skew in parallel query processing, See {38}, 212--223.
[15]
Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. 2010. Efficient algorithms for large-scale local triangle counting. ACM Trans. Knowl. Discov. Data 4, 3, Article 13 (Oct. 2010), 28 pages.
[16]
Béla Bollobás and Andrew Thomason. 1995. Projections of bodies and hereditary properties of hypergraphs. Bull. London Math. Soc. 27, 5 (1995), 417--424.
[17]
Terence H. Chan and Raymond W. Yeung. 2002. On a relation between information inequalities and group theory. IEEE Transactions on Information Theory 48, 7 (2002), 1992--1995.
[18]
Ashok K. Chandra and Philip M. Merlin. 1977. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In STOC. 77--90.
[19]
Shumo Chu, Magdalena Balazinska, and Dan Suciu. 2015. From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System, See {62}, 63--78.
[20]
F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer. 1986. Some intersection theorems for ordered sets and graphs. J. Combin. Theory Ser. A 43, 1 (1986), 23--37.
[21]
Thomas M. Cover and Joy A. Thomas. 2006. Elements of information theory (second ed.). Wiley-Interscience {John Wiley & Sons}, Hoboken, NJ. xxiv+748 pages.
[22]
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http://cidrdb. org/cidr2017/papers/p3-elgamal-cidr17.pdf
[23]
Ronald Fagin. 1983. Degrees of Acyclicity for Hypergraphs and Relational Database Schemes. J. ACM 30, 3 (1983), 514--550.
[24]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2001. Optimal Aggregation Algorithms for Middleware. In Proceedings of the Twentieth ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, May 21--23, 2001, Santa Barbara, California, USA, Peter Buneman (Ed.). ACM.
[25]
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a unified architecture for in-RDBMS analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 325--336.
[26]
Ehud Friedgut. 2004. Hypergraphs, entropy, and inequalities. Amer. Math. Monthly 111, 9 (2004), 749--760.
[27]
Ehud Friedgut and Jeff Kahn. 1998. On the number of copies of one hypergraph in another. Israel J. Math. 105 (1998), 251--256.
[28]
T. Gogacz and S. Toruńczyk. 2017. Entropy bounds for conjunctive queries with functional dependencies. In Proc. 20th International Conference on Database Theory (ICDT). To appear.
[29]
Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree Decompositions: Questions and Answers, See {48}, 57--74.
[30]
Georg Gottlob, Stephanie Tien Lee, Gregory Valiant, and Paul Valiant. 2012. Size and Treewidth Bounds for Conjunctive Queries. J. ACM 59, 3 (2012), 16.
[31]
Goetz Graefe. 1993. Query Evaluation Techniques for Large Databases. Comput. Surveys 25, 2 (June 1993), 73--170.
[32]
Martin Grohe and Dániel Marx. 2006. Constraint solving via fractional edge covers. In SODA. ACM Press, 289--298.
[33]
Martin Grohe and Dániel Marx. 2014. Constraint Solving via Fractional Edge Covers. ACM Transactions on Algorithms 11, 1 (2014), 4.
[34]
Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. 2014. Demonstration of the Myria big data management service. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 881--884.
[35]
G. H. Hardy, J. E. Littlewood, and G. Pólya. 1988. Inequalities. Cambridge University Press, Cambridge. xii+324 pages. Reprint of the 1952 edition.
[36]
Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 5, 12 (2012), 1700--1711. http://vldb.org/pvldb/vol5/p1700_ joehellerstein_vldb2012.pdf
[37]
Xiao Hu and Ke Yi. 2016. Towards a Worst-Case I/O-Optimal Algorithm for Acyclic Joins. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '16). ACM, New York, NY, USA, 135--150.
[38]
Richard Hull and Martin Grohe (Eds.). 2014. Proceedings of the 33rd ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22--27, 2014. ACM. http://dl.acm.org/citation.cfm?id=2594538
[39]
Yannis E. Ioannidis and Stavros Christodoulakis. 1991. On the Propagation of Errors in the Size of Join Results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, Denver, Colorado, May 29--31, 1991., James Clifford and Roger King (Eds.). ACM Press, 268--277.
[40]
Stasys Jukna. 2011. Extremal combinatorics (second ed.). Springer, Heidelberg. xxiii+411 pages. With applications in computer science
[41]
Chathura Kankanamge, Siddhartha Sahu, Amine Mhedbhi, Jeremy Chen, and Semih Salihoglu. 2017. Graphflow: An Active Graph Database. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 1695--1698.
[42]
Bas Ketsman and Dan Suciu. 2017. A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '17). ACM, New York, NY, USA, 417--428.
[43]
Phokion G. Kolaitis and Moshe Y. Vardi. 2000. Conjunctive-Query Containment and Constraint Satisfaction. J. Comput. Syst. Sci. 61, 2 (2000), 302--332.
[44]
Paraschos Koutris, Paul Beame, and Dan Suciu. 2016. Worst-Case Optimal Algorithms for Parallel Query Processing. In 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15--18, 2016 (LIPIcs), Wim Martens and Thomas Zeume (Eds.), Vol. 48. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 8:1--8:18.
[45]
Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data, See {62}, 1969--1984.
[46]
L. H. Loomis and H. Whitney. 1949. An inequality related to the isoperimetric inequality. Bull. Amer. Math. Soc 55 (1949), 961--962.
[47]
Frantisek Matus. 2007. Infinitely many information inequalities. In 2007 IEEE International Symposium on Information Theory. IEEE, 41--44.
[48]
Tova Milo and Wang-Chiew Tan (Eds.). 2016. Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016. ACM.
[49]
Hung Q. Ngo, Dung T. Nguyen, Christopher Ré, and Atri Rudra. 2014. Beyond worst-case analysis for joins with minesweeper, See {38}, 234--245.
[50]
Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2017. In-Database Factorized Learning. In Proceedings of the 11th Alberto Mendelzon International Workshop on Foundations of Data Management and the Web, Montevideo, Uruguay, June 7--9, 2017. (CEUR Workshop Proceedings), Juan L. Reutter and Divesh Srivastava (Eds.), Vol. 1912. CEUR-WS.org. http://ceur-ws.org/Vol-1912/ paper21.pdf
[51]
Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case optimal join algorithms: {extended abstract}. In PODS. 37--48.
[52]
Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2013. Skew strikes back: new developments in the theory of join algorithms. SIGMOD Record 42, 4 (2013), 5--16.
[53]
Dung T. Nguyen, Molham Aref, Martin Bravenboer, George Kollias, Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2015. Join Processing for Graph Patterns: An Old Dog with New Tricks. In Proceedings of the Third International Workshop on Graph Data Management Experiences and Systems, GRADES 2015, Melbourne, VIC, Australia, May 31 - June 4, 2015, Josep-Lluís Larriba-Pey and Theodore L. Willke (Eds.). ACM, 2:1--2:8.
[54]
Dan Olteanu and Florin Rusu. 2017. Special issue on in-database analytics. Distributed and Parallel Databases 35, 3--4 (2017), 333--334.
[55]
Dan Olteanu and Maximilian Schleich. 2016. F: Regression Models over Factorized Views. PVLDB 9, 13 (2016), 1573--1576. http://www.vldb.org/pvldb/vol9/ p1573-olteanu.pdf
[56]
Dan Olteanu and Jakub Závodný. 2015. Size Bounds for Factorised Representations of Query Results. ACM Trans. Database Syst. 40, 1, Article 2 (March 2015), 44 pages.
[57]
Christos H. Papadimitriou and Mihalis Yannakakis. 1997. On the Complexity of Database Queries. In PODS. 12--19.
[58]
Jaikumar Radhakrishnan. 2003. Entropy and Counting. Computational Mathematics, Modelling and Algorithms (2003), 146.
[59]
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning Linear Regression Models over Factorized Joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 3--18.
[60]
Alexander Schrijver. 1986. Theory of linear and integer programming. John Wiley & Sons Ltd., Chichester. xii+471 pages. A Wiley-Interscience Publication.
[61]
P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data (SIGMOD '79). ACM, New York, NY, USA, 23--34.
[62]
Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). 2015. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015. ACM. http: //dl.acm.org/citation.cfm?id=2723372
[63]
Siddharth Suri and Sergei Vassilvitskii. 2011. Counting triangles and the curse of the last reducer. In WWW. 607--614.
[64]
Charalampos E. Tsourakakis, U. Kang, Gary L. Miller, and Christos Faloutsos. 2009. DOULION: counting triangles in massive graphs with a coin. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 837--846.
[65]
Moshe Y. Vardi. 1982. The Complexity of Relational Query Languages (Extended Abstract). In STOC. 137--146.
[66]
Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24--28, 2014., Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy (Eds.). OpenProceedings.org, 96--106.
[67]
Raymond W. Yeung. 2008. Information Theory and Network Coding (1 ed.). Springer Publishing Company, Incorporated.
[68]
Zhen Zhang and Raymond W. Yeung. 1998. On Characterization of Entropy Function via Information Inequalities. IEEE Transactions on Information Theory 44, 4 (1998), 1440--1452.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
May 2018
462 pages
ISBN:9781450347068
DOI:10.1145/3196959
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entropy
  2. inequality
  3. join algorithm
  4. polymatroid
  5. worst-case optimal

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '18
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)115
  • Downloads (Last 6 weeks)8
Reflects downloads up to 30 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media