research-article

Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines

Editors: Chen Li, Volker Markl Authors:

Magdalena Balazinska,

Daniel HalperinAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 1542 - 1553

https://doi.org/10.14778/2824032.2824052

Published: 01 August 2015 Publication History

Abstract

We present a new approach for data analytics with iterations. Users express their analysis in Datalog with bag-monotonic aggregate operators, which enables the expression of computations from a broad variety of application domains. Queries are translated into query plans that can execute in shared-nothing engines, are incremental, and support a variety of iterative models (synchronous, asynchronous, different processing priorities) and failure-handling techniques. The plans require only small extensions to an existing shared-nothing engine, making the approach easily implementable. We implement the approach in the Myria big-data management system and use our implementation to empirically study the performance characteristics of different combinations of iterative models, failure handling methods, and applications. Our evaluation uses workloads from a variety of application domains. We find that no single method outperforms others but rather that application properties must drive the selection of the iterative query execution model.

References

[1]

Amazon EC2 spot instances. http://aws.amazon.com/ec2/purchasing-options/spot-instances/.

[2]

Apache flink. http://flink.apache.org/.

[3]

Greenplum. http://pivotal.io/big-data/pivotal-greenplum-database.

[4]

LogicBlox inc. http://www.logicblox.com/.

[5]

Myria: Big Data as a Service. http://myria.cs.washington.edu/.

[6]

SDSS SkyServer DR7. http://skyserver.sdss.org/dr7.

[7]

A. Abouzeid et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, 2009.

[8]

F. N. Afrati et al. Mapreduce extensions and recursive queries. In EDBT, 2011.

[9]

S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. In VLDB, 2014.

[10]

P. Alvaro et al. Consistency analysis in Bloom: a CALM and collected approach. In CIDR, 2011.

[11]

R. H. Arpaci-Dusseau et al. Cluster I/O with river: Making the fast case common. In IOPADS, 1999.

[12]

R. Avnur et al. Eddies: Continuously adaptive query processing. In SIGMOD, 2000.

[13]

V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.

[14]

Y. Bu et al. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010.

[15]

Y. Bu et al. Pregelix: Big(ger) graph analytics on a dataflow engine. In VLDB, 2014.

[16]

N. Conway et al. Logic and lattices for distributed programming. In SoCC, 2012.

[17]

D. E. M. de Oliveira et al. Orbit: Efficient processing of iterations. In SBBD, 2013.

[18]

J. Dean et al. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.

[19]

J. Ekanayake et al. Twister: A runtime for iterative MapReduce. In HPDC, 2010.

[20]

M. Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 1996.

[21]

S. Ewen et al. Spinning fast iterative data flows. In VLDB, 2012.

[22]

J. Gao et al. GLog: A high level graph analysis system using MapReduce. In ICDE, 2014.

[23]

A. Ghazal et al. Adaptive optimizations of recursive queries in teradata. In SIGMOD, 2012.

[24]

J. E. Gonzalez et al. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014.

[25]

D. Halperin et al. Demo of the Myria big data management service. In SIGMOD, 2014.

[26]

J. Hwang et al. High-availability algorithms for distributed stream processing. In ICDE, 2005.

[27]

M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.

[28]

D. Jiang et al. epiC: An extensible and scalable system for processing big data. In VLDB, 2014.

[29]

H. Kwak et al. What is Twitter, a social network or a news media? In WWW, 2010.

[30]

J. Leskovec et al. Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy, 2014.

[31]

Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. In VLDB, 2012.

[32]

Large Synoptic Survey Telescope. http://www.lsst.org/.

[33]

G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.

[34]

H. Menon et al. Adaptive Techniques for Clustered N-Body Cosmological Simulations. ArXiv e-prints, Sept. 2014.

[35]

S. Mihaylov et al. REX: Recursive, delta-based data-centric computation. In VLDB, 2012.

[36]

D. G. Murray et al. Naiad: A timely dataflow system. In SOSP, 2013.

[37]

M. Onizuka et al. Optimization for iterative queries on MapReduce. In VLDB, 2013.

[38]

J. Seo et al. Distributed SociaLite: A Datalog-based language for large-scale graph analysis. In VLDB, 2013.

[39]

J. Seo et al. SociaLite: Datalog extensions for efficient social network analysis. In ICDE, 2013.

[40]

Y. Shen et al. Fast failure recovery in distributed graph processing systems. In VLDB, 2014.

[41]

A. Shkapsky et al. Graph queries in a next-generation datalog system. In VLDB, 2013.

[42]

Sloan Digital Sky Survey. http://cas.sdss.org/.

[43]

University of Washington eScience Institute. http://escience.washington.edu/.

[44]

P. Upadhyaya et al. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD, 2011.

[45]

W. Vogels et al. The design and architecture of the Microsoft Cluster Service - a practical approach to high-availability and scalability. In FTCS, 1998.

[46]

G. Wang et al. Asynchronous large-scale graph processing made easy. In CIDR, 2013.

[47]

T. White. Hadoop: The Definitive Guide. 2009.

[48]

C. Xie et al. SYNC or ASYNC: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015.

[49]

M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

[50]

Y. Zhang et al. PrIter: a distributed framework for prioritized iterative computations. In SoCC, 2011.

Cited By

Fejza AGenevès PLayaïda N(2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681986
Shaikhha ASuciu DSchleich MNgo H(2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639271
Taranov KByan SMarathe VHoefler TIves ZBonifati AEl Abbadi A(2022)KafkaDirect: Zero-copy Data Access for Apache Kafka over RDMA NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526056(2191-2204)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526056
Show More Cited By

Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Metamorphic testing of Datalog engines
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Datalog is a popular query language with applications in several domains. Like any complex piece of software, Datalog engines may contain bugs. The most critical ones manifest as incorrect results when evaluating queries—we refer to these as query bugs. ...
On the decidability of containment of recursive datalog queries - preliminary report
PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

The problem of deciding query containment has important applications in classical query optimization and heterogeneous database systems. Query containment is undecidable for unrestricted recursive queries, and decidable for recursive monadic queries and ...
Decidable containment of recursive queries
Database theory

One of the most important reasoning tasks on queries is checking containment, i.e., verifying whether one query yields necessarily a subset of the result of another one. Query containment is crucial in several contexts, such as query optimization, query ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
227
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fejza AGenevès PLayaïda N(2024)Efficient Enumeration of Recursive Plans in Transformation-Based Query OptimizersProceedings of the VLDB Endowment10.14778/3681954.368198617:11(3095-3108)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681986
Shaikhha ASuciu DSchleich MNgo H(2024)Optimizing Nested Recursive QueriesProceedings of the ACM on Management of Data10.1145/36392712:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639271
Taranov KByan SMarathe VHoefler TIves ZBonifati AEl Abbadi A(2022)KafkaDirect: Zero-copy Data Access for Apache Kafka over RDMA NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526056(2191-2204)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526056
Wu JWang JZaniolo CIves ZBonifati AEl Abbadi A(2022)Optimizing Parallel Recursive Datalog Evaluation on Multicore MachinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517853(1433-1446)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517853
Wang YAbo Khamis MNgo HPichler RSuciu DIves ZBonifati AEl Abbadi A(2022)Optimizing Recursive Queries with Progam SynthesisProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517827(79-93)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517827
Imran MGévay GQuiané-Ruiz JMarkl V(2022)Fast datalog evaluation for batch and stream graph processingWorld Wide Web10.1007/s11280-021-00960-w25:2(971-1003)Online publication date: 20-Jan-2022
https://dl.acm.org/doi/10.1007/s11280-021-00960-w
Kaminski MKostylev EGrau BMotik BHorrocks I(2021)The Complexity and Expressive Power of Limit DatalogJournal of the ACM10.1145/349500969:1(1-83)Online publication date: 22-Dec-2021
https://dl.acm.org/doi/10.1145/3495009
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Zuo ZWang KHussain ASani AZhang YLu SDou WWang LLi XWang CXu G(2021)Systemizing Interprocedural Static Analysis of Large-scale Systems Code with GraspanACM Transactions on Computer Systems10.1145/346682038:1-2(1-39)Online publication date: 29-Jul-2021
https://dl.acm.org/doi/10.1145/3466820
Wang JWu JLi MGu JDas AZaniolo C(2021)Formal semantics and high performance in declarative machine learning using DatalogThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00665-630:5(859-881)Online publication date: 31-May-2021
https://dl.acm.org/doi/10.1007/s00778-021-00665-6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents