skip to main content
research-article

Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines

Published: 01 August 2015 Publication History

Abstract

We present a new approach for data analytics with iterations. Users express their analysis in Datalog with bag-monotonic aggregate operators, which enables the expression of computations from a broad variety of application domains. Queries are translated into query plans that can execute in shared-nothing engines, are incremental, and support a variety of iterative models (synchronous, asynchronous, different processing priorities) and failure-handling techniques. The plans require only small extensions to an existing shared-nothing engine, making the approach easily implementable. We implement the approach in the Myria big-data management system and use our implementation to empirically study the performance characteristics of different combinations of iterative models, failure handling methods, and applications. Our evaluation uses workloads from a variety of application domains. We find that no single method outperforms others but rather that application properties must drive the selection of the iterative query execution model.

References

[1]
Amazon EC2 spot instances. http://aws.amazon.com/ec2/purchasing-options/spot-instances/.
[2]
Apache flink. http://flink.apache.org/.
[3]
Greenplum. http://pivotal.io/big-data/pivotal-greenplum-database.
[4]
LogicBlox inc. http://www.logicblox.com/.
[5]
Myria: Big Data as a Service. http://myria.cs.washington.edu/.
[6]
SDSS SkyServer DR7. http://skyserver.sdss.org/dr7.
[7]
A. Abouzeid et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, 2009.
[8]
F. N. Afrati et al. Mapreduce extensions and recursive queries. In EDBT, 2011.
[9]
S. Alsubaiee et al. AsterixDB: A scalable, open source BDMS. In VLDB, 2014.
[10]
P. Alvaro et al. Consistency analysis in Bloom: a CALM and collected approach. In CIDR, 2011.
[11]
R. H. Arpaci-Dusseau et al. Cluster I/O with river: Making the fast case common. In IOPADS, 1999.
[12]
R. Avnur et al. Eddies: Continuously adaptive query processing. In SIGMOD, 2000.
[13]
V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, 2011.
[14]
Y. Bu et al. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010.
[15]
Y. Bu et al. Pregelix: Big(ger) graph analytics on a dataflow engine. In VLDB, 2014.
[16]
N. Conway et al. Logic and lattices for distributed programming. In SoCC, 2012.
[17]
D. E. M. de Oliveira et al. Orbit: Efficient processing of iterations. In SBBD, 2013.
[18]
J. Dean et al. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
[19]
J. Ekanayake et al. Twister: A runtime for iterative MapReduce. In HPDC, 2010.
[20]
M. Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 1996.
[21]
S. Ewen et al. Spinning fast iterative data flows. In VLDB, 2012.
[22]
J. Gao et al. GLog: A high level graph analysis system using MapReduce. In ICDE, 2014.
[23]
A. Ghazal et al. Adaptive optimizations of recursive queries in teradata. In SIGMOD, 2012.
[24]
J. E. Gonzalez et al. GraphX: Graph processing in a distributed dataflow framework. In OSDI, 2014.
[25]
D. Halperin et al. Demo of the Myria big data management service. In SIGMOD, 2014.
[26]
J. Hwang et al. High-availability algorithms for distributed stream processing. In ICDE, 2005.
[27]
M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.
[28]
D. Jiang et al. epiC: An extensible and scalable system for processing big data. In VLDB, 2014.
[29]
H. Kwak et al. What is Twitter, a social network or a news media? In WWW, 2010.
[30]
J. Leskovec et al. Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy, 2014.
[31]
Y. Low et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud. In VLDB, 2012.
[32]
Large Synoptic Survey Telescope. http://www.lsst.org/.
[33]
G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
[34]
H. Menon et al. Adaptive Techniques for Clustered N-Body Cosmological Simulations. ArXiv e-prints, Sept. 2014.
[35]
S. Mihaylov et al. REX: Recursive, delta-based data-centric computation. In VLDB, 2012.
[36]
D. G. Murray et al. Naiad: A timely dataflow system. In SOSP, 2013.
[37]
M. Onizuka et al. Optimization for iterative queries on MapReduce. In VLDB, 2013.
[38]
J. Seo et al. Distributed SociaLite: A Datalog-based language for large-scale graph analysis. In VLDB, 2013.
[39]
J. Seo et al. SociaLite: Datalog extensions for efficient social network analysis. In ICDE, 2013.
[40]
Y. Shen et al. Fast failure recovery in distributed graph processing systems. In VLDB, 2014.
[41]
A. Shkapsky et al. Graph queries in a next-generation datalog system. In VLDB, 2013.
[42]
Sloan Digital Sky Survey. http://cas.sdss.org/.
[43]
University of Washington eScience Institute. http://escience.washington.edu/.
[44]
P. Upadhyaya et al. A latency and fault-tolerance optimizer for online parallel query plans. In SIGMOD, 2011.
[45]
W. Vogels et al. The design and architecture of the Microsoft Cluster Service - a practical approach to high-availability and scalability. In FTCS, 1998.
[46]
G. Wang et al. Asynchronous large-scale graph processing made easy. In CIDR, 2013.
[47]
T. White. Hadoop: The Definitive Guide. 2009.
[48]
C. Xie et al. SYNC or ASYNC: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015.
[49]
M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
[50]
Y. Zhang et al. PrIter: a distributed framework for prioritized iterative computations. In SoCC, 2011.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015
Published in PVLDB Volume 8, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media