research-article

Open access

Bring Your Own Data Structures to Datalog

Authors:

Arash Sahebolamri,

Langston Barrett,

Kristopher MicinskiAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 7, Issue OOPSLA2

Article No.: 264, Pages 1198 - 1223

https://doi.org/10.1145/3622840

Published: 16 October 2023 Publication History

Abstract

The restricted logic programming language Datalog has become a popular implementation target for deductive-analytic workloads including social-media analytics and program analysis. Modern Datalog engines compile Datalog rules to joins over explicit representations of relations—often B-trees or hash maps. While these modern engines have enabled high scalability in many application domains, they have a crucial weakness: achieving the desired algorithmic complexity may be impossible due to representation-imposed overhead of the engine’s data structures. In this paper, we present the "Bring Your Own Data Structures" (Byods) approach, in the form of a DSL embedded in Rust. Using Byods, an engineer writes logical rules which are implicitly parametric on the concrete data structure representation; our implementation provides an interface to enable "bringing their own" data structures to represent relations, which harmoniously interact with code generated by our compiler (implemented as Rust procedural macros). We formalize the semantics of Byods as an extension of Datalog’s; our formalization captures the key properties demanded of data structures compatible with Byods, including properties required for incrementalized (semi-naïve) evaluation. We detail many applications of the Byods approach, implementing analyses requiring specialized data structures for transitive and equivalence relations to scale, including an optimized version of the Rust borrow checker Polonius; highly-parallel PageRank made possible by lattices; and a large-scale analysis of LLVM utilizing index-sharing to scale. Our results show that Byods offers both improved algorithmic scalability (reduced time and/or space complexity) and runtimes competitive with state-of-the-art parallelizing Datalog solvers.

References

[1]

Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. Ph. D. Dissertation. DIKU, University of Copenhagen.

[2]

Tony Antoniadis, Konstantinos Triantafyllou, and Yannis Smaragdakis. 2017. Porting doop to soufflé: a tale of inter-engine portability for datalog-based analyses. In Proceedings of the 6th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis. 25–30.

Digital Library

[3]

Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and Implementation of the LogicBlox System. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA. 1371–1382. isbn:9781450327589 https://doi.org/10.1145/2723372.2742796

Digital Library

[4]

Michael Arntzenius and Neel Krishnaswami. 2019. Seminaïve Evaluation for a Higher-Order Functional Language. Proc. ACM Program. Lang., 4, POPL (2019), Article 22, dec, 28 pages. https://doi.org/10.1145/3371090

Digital Library

[5]

Michael Arntzenius and Neelakantan R Krishnaswami. 2016. Datafun: a functional Datalog. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. 214–227. https://doi.org/10.1145/2951913.2951948

Digital Library

[6]

George Balatsouras and Yannis Smaragdakis. 2016. Structure-sensitive points-to analysis for C and C++. In Static Analysis: 23rd International Symposium, SAS 2016, Edinburgh, UK, September 8-10, 2016, Proceedings 23. 84–104.

[7]

François Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. Springer New York, New York, NY. 165–178. isbn:978-1-4612-4980-1 https://doi.org/10.1007/978-1-4612-4980-1_17

[8]

Francois Bancilhon and Raghu Ramakrishnan. 1986. An Amateur’s Introduction to Recursive Query Processing Strategies. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data (SIGMOD ’86). Association for Computing Machinery, New York, NY, USA. 16–52. isbn:0897911911 https://doi.org/10.1145/16894.16859

Digital Library

[9]

Aaron Bembenek, Michael Greenberg, and Stephen Chong. 2020. Formulog: Datalog for SMT-based static analysis. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), 1–31. https://doi.org/10.1145/3428209

Digital Library

[10]

Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specification of sophisticated points-to analyses. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications. 243–262.

Digital Library

[11]

Stefano Ceri, Georg Gottlob, and Letizia Tanca. 1989. What you always wanted to know about Datalog(and never dared to ask). IEEE transactions on knowledge and data engineering, 1, 1 (1989), 146–166. https://doi.org/10.1109/69.43410

Digital Library

[12]

Harold N. Gabow. 1976. An Efficient Implementation of Edmonds’ Algorithm for Maximum Matching on Graphs. J. ACM, 23, 2 (1976), apr, 221–234. issn:0004-5411 https://doi.org/10.1145/321941.321942

Digital Library

[13]

Zvi Galil and Giuseppe F. Italiano. 1991. Data Structures and Algorithms for Disjoint Set Union Problems. ACM Comput. Surv., 23, 3 (1991), sep, 319–344. issn:0360-0300 https://doi.org/10.1145/116873.116878

Digital Library

[14]

Bernard A. Galler and Michael J. Fisher. 1964. An Improved Equivalence Algorithm. Commun. ACM, 7, 5 (1964), may, 301–303. issn:0001-0782 https://doi.org/10.1145/364099.364331

Digital Library

[15]

Herbert Jordan, Bernhard Scholz, and Pavle Subotić. 2016. Soufflé: On Synthesis of Program Analyzers. In Computer Aided Verification, Swarat Chaudhuri and Azadeh Farzan (Eds.). Springer International Publishing, Cham. 422–430. isbn:978-3-319-41540-6 https://doi.org/10.1007/978-3-319-41540-6_23

[16]

Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. Brie: A Specialized Trie for Concurrent Datalog. PMAM’19. Association for Computing Machinery, New York, NY, USA. 31–40. isbn:9781450362900 https://doi.org/10.1145/3303084.3309490

Digital Library

[17]

Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2022. Specializing parallel data structures for Datalog. Concurrency and Computation: Practice and Experience, 34, 2 (2022), e5643. https://doi.org/10.1002/cpe.5643

[18]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004. 75–86.

[19]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data

[20]

LLVM-Authors. 2023. Opaque Pointers – LLVM documentation. https://llvm.org/docs/OpaquePointers.html

[21]

Nuno Lopes, Nikolaj Bjørner, Nick McKeown, Andrey Rybalchenko, Dan Talayco, and George Varghese. 2016. Automatically verifying reachability and well-formedness in P4 Networks. Technical Report, Tech. Rep.

[22]

Magnus Madsen, Ming-Ho Yee, and Ondřej Lhoták. 2016. From datalog to flix: A declarative language for fixed points on lattices. ACM SIGPLAN Notices, 51, 6 (2016), 194–208. https://doi.org/10.1145/2908080.2908096

Digital Library

[23]

Nicholas Matsakis and RustDevelopers. 2023. Rust-Lang/polonius: Defines the Rust borrow checker. https://github.com/rust-lang/polonius

[24]

Nicholas D. Matsakis and Felix S. Klock. 2014. The Rust Language. Ada Lett., 34, 3 (2014), oct, 103–104. issn:1094-3641 https://doi.org/10.1145/2692956.2663188

Digital Library

[25]

Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. 22, 4 (2013), aug, 471–493. issn:1066-8888 https://doi.org/10.1007/s00778-012-0299-1

Digital Library

[26]

Jay McCarthy. 2022. Datalog: Deductive Database Programming. https://docs.racket-lang.org/datalog/ Accessed 04-13-2023

[27]

Patrick Nappa, David Zhao, Pavle Subotić, and Bernhard Scholz. 2019. Fast parallel equivalence relations in a Datalog compiler. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 82–96.

Digital Library

[28]

André Pacak, Sebastian Erdweg, and Tamás Szabó. 2020. A Systematic Approach to Deriving Incremental Type Checkers. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 127, nov, 28 pages. https://doi.org/10.1145/3428195

Digital Library

[29]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.

[30]

Kenneth A. Ross and Yehoshua Sagiv. 1992. Monotonic Aggregation in Deductive Databases. In Proceedings of the Eleventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS ’92). Association for Computing Machinery, New York, NY, USA. 114–126. isbn:0897915194 https://doi.org/10.1145/137097.137852

Digital Library

[31]

Leonid Ryzhyk and Mihai Budiu. 2019. Differential Datalog. In Proceedings of the 4th International Workshop on the Resurgence of Datalog in Academia and Industry (Datalog-2.0). Philadelpha, PA. 56–67.

[32]

Arash Sahebolamri, Thomas Gilray, and Kristopher Micinski. 2022. Seamless deductive inference via macros. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 77–88.

Digital Library

[33]

Jiwon Seo, Jongsoo Park, Jaeho Shin, and Monica S. Lam. 2013. Distributed Socialite: A Datalog-Based Language for Large-Scale Graph Analysis. Proc. VLDB Endow., 6, 14 (2013), sep, 1906–1917. issn:2150-8097 https://doi.org/10.14778/2556549.2556572

Digital Library

[34]

Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA. 1135–1149. isbn:9781450335317 https://doi.org/10.1145/2882903.2915229

Digital Library

[35]

Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2015. Optimizing recursive queries with monotonic aggregates in deals. In 2015 IEEE 31st International Conference on Data Engineering. 867–878.

[36]

Yannis Smaragdakis and Martin Bravenboer. 2011. Using Datalog for Fast and Easy Program Analysis. In Proceedings of the First International Conference on Datalog Reloaded (Datalog’10). Springer-Verlag, Berlin, Heidelberg. 245–251. isbn:978-3-642-24205-2 https://doi.org/10.1007/978-3-642-24206-9_14

Digital Library

[37]

Bjarne Steensgaard. 1996. Points-to analysis in almost linear time. In Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 32–41.

Digital Library

[38]

Pavle Subotić, Herbert Jordan, Lijun Chang, Alan Fekete, and Bernhard Scholz. 2018. Automatic Index Selection for Large-scale Datalog Computation. Proc. VLDB Endow., 12, 2 (2018), Oct., 141–153. issn:2150-8097 https://doi.org/10.14778/3282495.3282500

Digital Library

[39]

Tamás Szabó, Sebastian Erdweg, and Gábor Bergmann. 2021. Incremental Whole-Program Analysis in Datalog with Lattices. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, New York, NY, USA. 1–15. isbn:9781450383912 https://doi.org/10.1145/3453483.3454026

Digital Library

[40]

Alfred Tarski. 1955. A lattice-theoretical fixpoint theorem and its applications.

[41]

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality Saturation: A New Approach to Optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09). Association for Computing Machinery, New York, NY, USA. 264–276. isbn:9781605583792 https://doi.org/10.1145/1480881.1480915

Digital Library

[42]

Rijnard van Tonder. 2021. Towards Fully Declarative Program Analysis via Source Code Transformation. arXiv preprint arXiv:2112.12398.

[43]

Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In International Conference on Database Theory.

[44]

Jin Wang, Guorui Xiao, Jiaqi Gu, Jiacheng Wu, and Carlo Zaniolo. 2020. RASQL: A Powerful Language and Its System for Big Data Applications. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 2673–2676. isbn:9781450367356 https://doi.org/10.1145/3318464.3384677

Digital Library

[45]

Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, and Guoqing Harry Xu. 2018. RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on a Single Machine. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, USA. 763–782. isbn:9781931971478

[46]

Aaron Weiss, Olek Gierczak, Daniel Patterson, Nicholas D Matsakis, and Amal Ahmed. 2019. Oxide: The essence of rust. arXiv preprint arXiv:1903.00982, https://doi.org/10.48550/arXiv.1903.00982

[47]

John Whaley, Dzintars Avots, Michael Carbin, and Monica S Lam. 2005. Using Datalog with binary decision diagrams for program analysis. In Asian Symposium on Programming Languages and Systems. 97–118.

Digital Library

[48]

Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. Egg: Fast and Extensible Equality Saturation. Proc. ACM Program. Lang., 5, POPL (2021), Article 23, jan, 29 pages. https://doi.org/10.1145/3434304

Digital Library

[49]

Jiacheng Wu, Jin Wang, and Carlo Zaniolo. 2022. Optimizing parallel recursive datalog evaluation on multicore machines. In Proceedings of the 2022 International Conference on Management of Data. 1433–1446.

Digital Library

[50]

Carlo Zaniolo, Mohan Yang, Ariyam Das, Alexander Shkapsky, Tyson Condie, and Matteo Interlandi. 2017. Fixpoint semantics and optimization of recursive datalog programs with aggregates. Theory and Practice of Logic Programming, 17, 5-6 (2017), 1048–1065.

[51]

Eric Zhang. 2023. Datalog compiler embedded in Rust as a procedural macro. https://github.com/ekzhang/crepe

[52]

Yihong Zhang, Yisu Remy Wang, Oliver Flatt, David Cao, Philip Zucker, Eli Rosenthal, Zachary Tatlock, and Max Willsey. 2023. Better Together: Unifying Datalog and Equality Saturation. Proc. ACM Program. Lang., 7, PLDI (2023), Article 125, jun, 25 pages. https://doi.org/10.1145/3591239

Digital Library

Cited By

Shevchenko RDoroshenko АYatsenko O(2024)Embedding a family of logic languages with custom monadic unification in ScalaPROBLEMS IN PROGRAMMING10.15407/pp2024.01.003(03-11)Online publication date: Jan-2024
https://doi.org/10.15407/pp2024.01.003
Rumbaugh DXie DZhao Z(2024)Towards Systematic Index DynamizationProceedings of the VLDB Endowment10.14778/3681954.368196917:11(2867-2879)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681969
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Show More Cited By

Index Terms

Bring Your Own Data Structures to Datalog
1. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages
2. Theory of computation
  1. Logic
    1. Constraint and logic programming

Recommendations

From Datalog to flix: a declarative language for fixed points on lattices
PLDI '16

We present Flix, a declarative programming language for specifying and solving least fixed point problems, particularly static program analyses. Flix is inspired by Datalog and extends it with lattices and monotone functions. Using Flix, implementors ...
From Datalog to flix: a declarative language for fixed points on lattices
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

We present Flix, a declarative programming language for specifying and solving least fixed point problems, particularly static program analyses. Flix is inspired by Datalog and extends it with lattices and monotone functions. Using Flix, implementors ...
On the complexity of single-rule datalog queries
Special issue: ICC '99

Datalog programs containing a unique rule and possibly some facts are known as single rule programs, or sirups. We study the complexity of evaluating sirups over variable and fixed databases, respectively, as well as the descriptive complexity of sirups,...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 7, Issue OOPSLA2

October 2023

2250 pages

EISSN:2475-1421

DOI:10.1145/3554312

Editor:
Michael Hicks
Amazon, USA

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023

Published in PACMPL Volume 7, Issue OOPSLA2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
3,637
Total Downloads

Downloads (Last 12 months)3,034
Downloads (Last 6 weeks)98

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shevchenko RDoroshenko АYatsenko O(2024)Embedding a family of logic languages with custom monadic unification in ScalaPROBLEMS IN PROGRAMMING10.15407/pp2024.01.003(03-11)Online publication date: Jan-2024
https://doi.org/10.15407/pp2024.01.003
Rumbaugh DXie DZhao Z(2024)Towards Systematic Index DynamizationProceedings of the VLDB Endowment10.14778/3681954.368196917:11(2867-2879)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681969
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Klopp DPacak AErdweg SChiba SThüm T(2024)Separate Compilation and Partial Linking: Modules for Datalog IRProceedings of the 23rd ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences10.1145/3689484.3690737(94-106)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3689484.3690737
Abeysinghe SXhebraj ARompf T(2024)Flan: An Expressive and Efficient Datalog Compiler for Program AnalysisProceedings of the ACM on Programming Languages10.1145/36329288:POPL(2577-2609)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632928

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents