research-article

Automatic index selection for large-scale datalog computation

Authors:

Pavle Subotić,

Herbert Jordan,

Bernhard ScholzAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 12, Issue 2

Pages 141 - 153

https://doi.org/10.14778/3282495.3282500

Published: 01 October 2018 Publication History

Abstract

Datalog has been applied to several use cases that require very high performance on large rulesets and factsets. It is common to create indexes for relations to improve search performance. However, the existing indexing schemes either require manual index selection or result in insufficient performance on very large tasks. In this paper, we propose an automatic scheme to select indexes. We automatically create the minimum number of indexes to speed up all the searches in a given Datalog program. We have integrated our indexing scheme into an open-source Datalog engine SOUFFLÉ. We obtain performance on a par with what users have accepted from hand-optimized Datalog programs running on state-of-the-art Datalog engines, while we do not require the effort of manual index selection. Extensive experiments on large real Datalog programs demonstrate that our indexing scheme results in considerable speedups (up to 2x) and significantly less memory usage (up to 6x) compared with other automated index selections.

References

[1]

S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.

Digital Library

[2]

T. Antoniadis, K. Triantafyllou, and Y. Smaragdakis. Porting doop to souffle;: A tale of inter-engine portability for datalog-based analyses. In Proc. SOAP Workshop, pages 25--30, 2017.

Digital Library

[3]

M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, and G. Washburn. Design and implementation of the logicblox system. In Proc. SIGMOD, pages 1371--1382, 2015.

Digital Library

[4]

S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proc. OOPSLA, pages 169--190, Oct. 2006.

Digital Library

[5]

S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In Proc. SIGMOD, pages 37--48, 2011.

Digital Library

[6]

M. Bravenboer and Y. Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. In Proc. OOPSLA, pages 243--262, 2009.

Digital Library

[7]

M. Bravenboer and Y. Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. SIGPLAN Not., 44(10):243--262, Oct. 2009.

Digital Library

[8]

N. Bruno. Automated Physical Database Design and Tuning. CRC Press, Inc., Boca Raton, FL, USA, 1st edition, 2011.

Digital Library

[9]

S. Ceri, G. Gottlob, and L. Tanca. What you always wanted to know about datalog (and never dared to ask). IEEE Trans. on Knowl. and Data Eng., 1(1):146--166, 1989.

Digital Library

[10]

S. Chaudhuri and V. Narasayya. Autoadmin "what-if" index analysis utility. Association for Computing Machinery, Inc., June 1998.

[11]

S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for Microsoft SQL Server. In Proc. VLDB, pages 146--155, 1997.

Digital Library

[12]

D. Comer. The difficulty of optimum index selection. ACM Trans. Database Syst., 3(4):440--445, Dec. 1978.

Digital Library

[13]

T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.

Digital Library

[14]

J. Dietrich, N. Hollingum, and B. Scholz. Giga-scale exhaustive points-to analysis for java in under a minute. In Proc. OOPSLA, pages 535--551, 2015.

Digital Library

[15]

R. Dilworth. A decomposition theorem for partially ordered sets. Ann. Math. (2), 51:161--166, 1950.

[16]

D. R. Fulkerson. Note on dilworth's decomposition theorem for partially ordered sets. Proc. Amer. Math. Soc., 7(4):pp. 701--702, 1956.

[17]

N. Grech, M. Kong, A. Jurisevic, L. Brent, B. Scholz, and Y. Smaragdakis. Madmax: Surviving out-of-gas conditions in ethereum smart contracts. In Proc. OOPSLA (to appear), 2018.

Digital Library

[18]

T. J. Green, S. S. Huang, B. T. Loo, and W. Zhou. Datalog and recursive query processing. Foundations and Trends in Databases, 5(2):105--195, 2013.

Digital Library

[19]

K. Hoder, N. Bjørner, and L. M. de Moura. Z- an efficient engine for fixed points with constraints. In Proc. CAV, pages 457--462, 2011.

Digital Library

[20]

M. Ip, L. Saxton, and V. Raghavan. On the selection of an optimal set of indexes. IEEE Trans. on Software Engineering, SE-9(2):135--143, March 1983.

Digital Library

[21]

H. Jordan, B. Scholz, and P. Subotic. Soufflé: On synthesis of program analyzers. In Proc. CAV, pages 422--430, 2016.

[22]

J. Kratica, I. Ljubic, and D. Tošic. A genetic algorithm for the index selection problem. In Proc. of EvoWorkshops, pages 280--290, Berlin, Heidelberg, 2003. Springer-Verlag.

Digital Library

[23]

LogicBlox and P. (UoA). PA-Datalog. http://snf-705535.vm.okeanos.grnet.gr/agreement.html,2018. {Online; accessed 30-Jan-2018}.

[24]

LogicBlox Inc. Declartive cloud platform for applications that combine transactions & analytics. http://www.logicblox.com.

[25]

M. Madsen, M.-H. Yee, and O. Lhoták. From datalog to flix: A declarative language for fixed points on lattices. SIGPLAN Not., 51(6):194--208, June 2016.

Digital Library

[26]

G. Piatetsky-Shapiro. The Optimal Selection of Secondary Indices is NP-complete. SIGMOD Rec., 13(2):72--75, Jan. 1983.

Digital Library

[27]

W. Pijls and R. Potharst. Another note on dilworth's decomposition theorem. Journal of Discrete Mathematics, 2013:4, 2013.

[28]

R. Ramakrishnan, D. Srivastava, and S. Sudarshan. Efficient bottom-up evaluation of logic programs. In P. Dewilde and J. Vandewalle, editors, Computer Systems and Software Engineering, pages 287--324. Springer US, 1992.

[29]

R. Ramakrishnan and J. D. Ullman. A survey of deductive database systems. Journal of Logic Programming, 23(2):125--149, 1995.

[30]

K. Ramamohanarao and J. Harland. An introduction to deductive database languages and systems. PVLDB, 3(2):107--122, 1994.

Digital Library

[31]

M. Schkolnick. The optimal selection of secondary indices for files. Information Systems, 1(4):141 -- 146, 1975.

[32]

K. Schnaitter and N. Polyzotis. Semi-automatic index tuning: Keeping dbas in the loop. PVLDB, 5(5):478--489, 2012.

Digital Library

[33]

B. Scholz, H. Jordan, P. Subotic, and T. Westmann. On fast large-scale program analysis in datalog. In Proc. CC, pages 196--206, 2016.

Digital Library

[34]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. SIGMOD, pages 23--34, 1979.

Digital Library

[35]

A. W. Services. Serverless Architectures with AWS Lambda. Technical report, Amazon Web Services, 11 2017.

[36]

A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with datalog queries on spark. In Proc. SIGMOD, pages 1135--1149, 2016.

Digital Library

[37]

Y. Smaragdaiks, M. Bravenboer, and G. Kastrinis. Doop: A framework for java pointer analysis. http://doop.program-analysis.org/.

[38]

J. Whaley, D. Avots, M. Carbin, and M. S. Lam. Using datalog with binary decision diagrams for program analysis. In Proc. APLAS, pages 97--118, 2005.

Digital Library

[39]

M. Yang, A. Shkapsky, and C. Zaniolo. Scaling up the performance of more powerful datalog systems on multicore machines. The VLDB Journal, 26(2):229--248, Apr. 2017.

Digital Library

Cited By

Klopp DErdweg SPacak A(2024)A Typed Multi-level Datalog IR and Its Compiler FrameworkProceedings of the ACM on Programming Languages10.1145/36897678:OOPSLA2(1586-1614)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689767
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Zhang CWang LRigger M(2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649815
Show More Cited By

Recommendations

A Survey on Bitmap Index Technologies for Large-Scale Data Retrieval
ICINIS '13: Proceedings of the 2013 6th International Conference on Intelligent Networks and Intelligent Systems

Modern research equipment such as telescopes, particle colliders, and supercomputers is generating so large amounts of data. Consequently, many scientists worry that they will not be able to keep pace with the deluge. The extremely large data volumes ...
ALMSS: Automatic Learned Index Model Selection System
Web and Big Data
Abstract
Index is an indispensable part of database. As we enter the era of big data, the traditional index structure is found not to support large-scale data well. Although many index structures such as learned indexes based on machine learning have been ...
Tensor index for large scale image retrieval

Recently, the bag-of-words representation is widely applied in the image retrieval applications. In this model, visual word is a core component. However, compared with text retrieval, one major problem associated with image retrieval consists in the ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 2

October 2018

98 pages

ISSN:2150-8097

Editors:
Lei Chen
HKUST
,
Fatma Özcan
IBM Research - Almaden

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2018

Published in PVLDB Volume 12, Issue 2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)2

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Klopp DErdweg SPacak A(2024)A Typed Multi-level Datalog IR and Its Compiler FrameworkProceedings of the ACM on Programming Languages10.1145/36897678:OOPSLA2(1586-1614)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689767
Bembenek AGreenberg MChong S(2024)Making Formulog Fast: An Argument for Unconventional Datalog EvaluationProceedings of the ACM on Programming Languages10.1145/36897548:OOPSLA2(1219-1248)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689754
Zhang CWang LRigger M(2024)Finding Cross-Rule Optimization Bugs in Datalog EnginesProceedings of the ACM on Programming Languages10.1145/36498158:OOPSLA1(110-136)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649815
Abeysinghe SXhebraj ARompf T(2024)Flan: An Expressive and Efficient Datalog Compiler for Program AnalysisProceedings of the ACM on Programming Languages10.1145/36329288:POPL(2577-2609)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632928
Sahebolamri ABarrett LMoore SMicinski K(2023)Bring Your Own Data Structures to DatalogProceedings of the ACM on Programming Languages10.1145/36228407:OOPSLA2(1198-1223)Online publication date: 16-Oct-2023
https://dl.acm.org/doi/10.1145/3622840
Liu YMechtaev SSubotić PRoychoudhury AChandra SBlincoe KTonella P(2023)Program Repair Guided by Datalog-Defined Static AnalysisProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616363(1216-1228)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616363
Sahebolamri AGilray TMicinski KEgger BSmith A(2022)Seamless deductive inference via macrosProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517779(77-88)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517779
Imran MGévay GQuiané-Ruiz JMarkl V(2022)Fast datalog evaluation for batch and stream graph processingWorld Wide Web10.1007/s11280-021-00960-w25:2(971-1003)Online publication date: 20-Jan-2022
https://dl.acm.org/doi/10.1007/s11280-021-00960-w
Arch SHu XZhao DSubotić PScholz B(2022)Building a Join Optimizer for SouffléLogic-Based Program Synthesis and Transformation10.1007/978-3-031-16767-6_5(83-102)Online publication date: 21-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-16767-6_5
Hu XZhao DJordan HScholz BFreund SYahav E(2021)An efficient interpreter for Datalog by de-specializing relationsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454070(681-695)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454070
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents