skip to main content
research-article
Open access

Bring Your Own Data Structures to Datalog

Published: 16 October 2023 Publication History

Abstract

The restricted logic programming language Datalog has become a popular implementation target for deductive-analytic workloads including social-media analytics and program analysis. Modern Datalog engines compile Datalog rules to joins over explicit representations of relations—often B-trees or hash maps. While these modern engines have enabled high scalability in many application domains, they have a crucial weakness: achieving the desired algorithmic complexity may be impossible due to representation-imposed overhead of the engine’s data structures. In this paper, we present the "Bring Your Own Data Structures" (Byods) approach, in the form of a DSL embedded in Rust. Using Byods, an engineer writes logical rules which are implicitly parametric on the concrete data structure representation; our implementation provides an interface to enable "bringing their own" data structures to represent relations, which harmoniously interact with code generated by our compiler (implemented as Rust procedural macros). We formalize the semantics of Byods as an extension of Datalog’s; our formalization captures the key properties demanded of data structures compatible with Byods, including properties required for incrementalized (semi-naïve) evaluation. We detail many applications of the Byods approach, implementing analyses requiring specialized data structures for transitive and equivalence relations to scale, including an optimized version of the Rust borrow checker Polonius; highly-parallel PageRank made possible by lattices; and a large-scale analysis of LLVM utilizing index-sharing to scale. Our results show that Byods offers both improved algorithmic scalability (reduced time and/or space complexity) and runtimes competitive with state-of-the-art parallelizing Datalog solvers.

References

[1]
Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. Ph. D. Dissertation. DIKU, University of Copenhagen.
[2]
Tony Antoniadis, Konstantinos Triantafyllou, and Yannis Smaragdakis. 2017. Porting doop to soufflé: a tale of inter-engine portability for datalog-based analyses. In Proceedings of the 6th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis. 25–30.
[3]
Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and Implementation of the LogicBlox System. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA. 1371–1382. isbn:9781450327589 https://doi.org/10.1145/2723372.2742796
[4]
Michael Arntzenius and Neel Krishnaswami. 2019. Seminaïve Evaluation for a Higher-Order Functional Language. Proc. ACM Program. Lang., 4, POPL (2019), Article 22, dec, 28 pages. https://doi.org/10.1145/3371090
[5]
Michael Arntzenius and Neelakantan R Krishnaswami. 2016. Datafun: a functional Datalog. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming. 214–227. https://doi.org/10.1145/2951913.2951948
[6]
George Balatsouras and Yannis Smaragdakis. 2016. Structure-sensitive points-to analysis for C and C++. In Static Analysis: 23rd International Symposium, SAS 2016, Edinburgh, UK, September 8-10, 2016, Proceedings 23. 84–104.
[7]
François Bancilhon. 1986. Naive Evaluation of Recursively Defined Relations. Springer New York, New York, NY. 165–178. isbn:978-1-4612-4980-1 https://doi.org/10.1007/978-1-4612-4980-1_17
[8]
Francois Bancilhon and Raghu Ramakrishnan. 1986. An Amateur’s Introduction to Recursive Query Processing Strategies. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data (SIGMOD ’86). Association for Computing Machinery, New York, NY, USA. 16–52. isbn:0897911911 https://doi.org/10.1145/16894.16859
[9]
Aaron Bembenek, Michael Greenberg, and Stephen Chong. 2020. Formulog: Datalog for SMT-based static analysis. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), 1–31. https://doi.org/10.1145/3428209
[10]
Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specification of sophisticated points-to analyses. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications. 243–262.
[11]
Stefano Ceri, Georg Gottlob, and Letizia Tanca. 1989. What you always wanted to know about Datalog(and never dared to ask). IEEE transactions on knowledge and data engineering, 1, 1 (1989), 146–166. https://doi.org/10.1109/69.43410
[12]
Harold N. Gabow. 1976. An Efficient Implementation of Edmonds’ Algorithm for Maximum Matching on Graphs. J. ACM, 23, 2 (1976), apr, 221–234. issn:0004-5411 https://doi.org/10.1145/321941.321942
[13]
Zvi Galil and Giuseppe F. Italiano. 1991. Data Structures and Algorithms for Disjoint Set Union Problems. ACM Comput. Surv., 23, 3 (1991), sep, 319–344. issn:0360-0300 https://doi.org/10.1145/116873.116878
[14]
Bernard A. Galler and Michael J. Fisher. 1964. An Improved Equivalence Algorithm. Commun. ACM, 7, 5 (1964), may, 301–303. issn:0001-0782 https://doi.org/10.1145/364099.364331
[15]
Herbert Jordan, Bernhard Scholz, and Pavle Subotić. 2016. Soufflé: On Synthesis of Program Analyzers. In Computer Aided Verification, Swarat Chaudhuri and Azadeh Farzan (Eds.). Springer International Publishing, Cham. 422–430. isbn:978-3-319-41540-6 https://doi.org/10.1007/978-3-319-41540-6_23
[16]
Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. Brie: A Specialized Trie for Concurrent Datalog. PMAM’19. Association for Computing Machinery, New York, NY, USA. 31–40. isbn:9781450362900 https://doi.org/10.1145/3303084.3309490
[17]
Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2022. Specializing parallel data structures for Datalog. Concurrency and Computation: Practice and Experience, 34, 2 (2022), e5643. https://doi.org/10.1002/cpe.5643
[18]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International symposium on code generation and optimization, 2004. CGO 2004. 75–86.
[19]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data
[20]
LLVM-Authors. 2023. Opaque Pointers – LLVM documentation. https://llvm.org/docs/OpaquePointers.html
[21]
Nuno Lopes, Nikolaj Bjørner, Nick McKeown, Andrey Rybalchenko, Dan Talayco, and George Varghese. 2016. Automatically verifying reachability and well-formedness in P4 Networks. Technical Report, Tech. Rep.
[22]
Magnus Madsen, Ming-Ho Yee, and Ondřej Lhoták. 2016. From datalog to flix: A declarative language for fixed points on lattices. ACM SIGPLAN Notices, 51, 6 (2016), 194–208. https://doi.org/10.1145/2908080.2908096
[23]
Nicholas Matsakis and RustDevelopers. 2023. Rust-Lang/polonius: Defines the Rust borrow checker. https://github.com/rust-lang/polonius
[24]
Nicholas D. Matsakis and Felix S. Klock. 2014. The Rust Language. Ada Lett., 34, 3 (2014), oct, 103–104. issn:1094-3641 https://doi.org/10.1145/2692956.2663188
[25]
Mirjana Mazuran, Edoardo Serra, and Carlo Zaniolo. 2013. Extending the Power of Datalog Recursion. 22, 4 (2013), aug, 471–493. issn:1066-8888 https://doi.org/10.1007/s00778-012-0299-1
[26]
Jay McCarthy. 2022. Datalog: Deductive Database Programming. https://docs.racket-lang.org/datalog/ Accessed 04-13-2023
[27]
Patrick Nappa, David Zhao, Pavle Subotić, and Bernhard Scholz. 2019. Fast parallel equivalence relations in a Datalog compiler. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). 82–96.
[28]
André Pacak, Sebastian Erdweg, and Tamás Szabó. 2020. A Systematic Approach to Deriving Incremental Type Checkers. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 127, nov, 28 pages. https://doi.org/10.1145/3428195
[29]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
[30]
Kenneth A. Ross and Yehoshua Sagiv. 1992. Monotonic Aggregation in Deductive Databases. In Proceedings of the Eleventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS ’92). Association for Computing Machinery, New York, NY, USA. 114–126. isbn:0897915194 https://doi.org/10.1145/137097.137852
[31]
Leonid Ryzhyk and Mihai Budiu. 2019. Differential Datalog. In Proceedings of the 4th International Workshop on the Resurgence of Datalog in Academia and Industry (Datalog-2.0). Philadelpha, PA. 56–67.
[32]
Arash Sahebolamri, Thomas Gilray, and Kristopher Micinski. 2022. Seamless deductive inference via macros. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 77–88.
[33]
Jiwon Seo, Jongsoo Park, Jaeho Shin, and Monica S. Lam. 2013. Distributed Socialite: A Datalog-Based Language for Large-Scale Graph Analysis. Proc. VLDB Endow., 6, 14 (2013), sep, 1906–1917. issn:2150-8097 https://doi.org/10.14778/2556549.2556572
[34]
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big Data Analytics with Datalog Queries on Spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA. 1135–1149. isbn:9781450335317 https://doi.org/10.1145/2882903.2915229
[35]
Alexander Shkapsky, Mohan Yang, and Carlo Zaniolo. 2015. Optimizing recursive queries with monotonic aggregates in deals. In 2015 IEEE 31st International Conference on Data Engineering. 867–878.
[36]
Yannis Smaragdakis and Martin Bravenboer. 2011. Using Datalog for Fast and Easy Program Analysis. In Proceedings of the First International Conference on Datalog Reloaded (Datalog’10). Springer-Verlag, Berlin, Heidelberg. 245–251. isbn:978-3-642-24205-2 https://doi.org/10.1007/978-3-642-24206-9_14
[37]
Bjarne Steensgaard. 1996. Points-to analysis in almost linear time. In Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 32–41.
[38]
Pavle Subotić, Herbert Jordan, Lijun Chang, Alan Fekete, and Bernhard Scholz. 2018. Automatic Index Selection for Large-scale Datalog Computation. Proc. VLDB Endow., 12, 2 (2018), Oct., 141–153. issn:2150-8097 https://doi.org/10.14778/3282495.3282500
[39]
Tamás Szabó, Sebastian Erdweg, and Gábor Bergmann. 2021. Incremental Whole-Program Analysis in Datalog with Lattices. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, New York, NY, USA. 1–15. isbn:9781450383912 https://doi.org/10.1145/3453483.3454026
[40]
Alfred Tarski. 1955. A lattice-theoretical fixpoint theorem and its applications.
[41]
Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality Saturation: A New Approach to Optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09). Association for Computing Machinery, New York, NY, USA. 264–276. isbn:9781605583792 https://doi.org/10.1145/1480881.1480915
[42]
Rijnard van Tonder. 2021. Towards Fully Declarative Program Analysis via Source Code Transformation. arXiv preprint arXiv:2112.12398.
[43]
Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In International Conference on Database Theory.
[44]
Jin Wang, Guorui Xiao, Jiaqi Gu, Jiacheng Wu, and Carlo Zaniolo. 2020. RASQL: A Powerful Language and Its System for Big Data Applications. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 2673–2676. isbn:9781450367356 https://doi.org/10.1145/3318464.3384677
[45]
Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, and Guoqing Harry Xu. 2018. RStream: Marrying Relational Algebra with Streaming for Efficient Graph Mining on a Single Machine. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, USA. 763–782. isbn:9781931971478
[46]
Aaron Weiss, Olek Gierczak, Daniel Patterson, Nicholas D Matsakis, and Amal Ahmed. 2019. Oxide: The essence of rust. arXiv preprint arXiv:1903.00982, https://doi.org/10.48550/arXiv.1903.00982
[47]
John Whaley, Dzintars Avots, Michael Carbin, and Monica S Lam. 2005. Using Datalog with binary decision diagrams for program analysis. In Asian Symposium on Programming Languages and Systems. 97–118.
[48]
Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. Egg: Fast and Extensible Equality Saturation. Proc. ACM Program. Lang., 5, POPL (2021), Article 23, jan, 29 pages. https://doi.org/10.1145/3434304
[49]
Jiacheng Wu, Jin Wang, and Carlo Zaniolo. 2022. Optimizing parallel recursive datalog evaluation on multicore machines. In Proceedings of the 2022 International Conference on Management of Data. 1433–1446.
[50]
Carlo Zaniolo, Mohan Yang, Ariyam Das, Alexander Shkapsky, Tyson Condie, and Matteo Interlandi. 2017. Fixpoint semantics and optimization of recursive datalog programs with aggregates. Theory and Practice of Logic Programming, 17, 5-6 (2017), 1048–1065.
[51]
Eric Zhang. 2023. Datalog compiler embedded in Rust as a procedural macro. https://github.com/ekzhang/crepe
[52]
Yihong Zhang, Yisu Remy Wang, Oliver Flatt, David Cao, Philip Zucker, Eli Rosenthal, Zachary Tatlock, and Max Willsey. 2023. Better Together: Unifying Datalog and Equality Saturation. Proc. ACM Program. Lang., 7, PLDI (2023), Article 125, jun, 25 pages. https://doi.org/10.1145/3591239

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 7, Issue OOPSLA2
October 2023
2250 pages
EISSN:2475-1421
DOI:10.1145/3554312
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution 4.0 International License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023
Published in PACMPL Volume 7, Issue OOPSLA2

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Datalog
  2. Logic Programming
  3. Program Analysis
  4. Static Analysis

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,034
  • Downloads (Last 6 weeks)98
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media