research-article

Declarative Recursive Computation on an RDBMS: or, Why You Should Use a Database For Distributed Machine Learning

Authors:

Dimitrije Jankov,

Chris Jermaine,

Zekai J. GaoAuthors Info & Claims

ACM SIGMOD Record, Volume 49, Issue 1

Pages 43 - 50

https://doi.org/10.1145/3422648.3422659

Published: 04 September 2020 Publication History

Abstract

We explore the close relationship between the tensor-based computations performed during modern machine learning, and relational database computations. We consider how to make a very small set of changes to a modern RDBMS to make it suitable for distributed learning computations. Changes include adding better support for recursion, and optimization and execution of very large compute plans. We also show that there are key advantages to using an RDBMS as a machine learning platform. In particular, DBMSbased learning allows for trivial scaling to large data sets and especially large models, where different computational units operate on different parts of a model that may be too large to fit into RAM.

References

[1]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29. ACM, 1996.

Digital Library

[2]

Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD 2013, pages 637--648. ACM, 2013.

Digital Library

[3]

J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.

[4]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arXiv preprint arXiv:1512.01274, 2015.

[5]

A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. Deep learning with cots hpc systems. In ICML 2013, ICML'13, pages III--1337--III--1345. JMLR.org, 2013.

Digital Library

[6]

M. A. et. al. Tensorflow: A system for large-scale machine learning. In OSDI 16, pages 265--283, GA, 2016. USENIX Association.

Digital Library

[7]

M. A. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, 2016.

[8]

A. L. Gaunt, M. A. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks. arXiv preprint arXiv:1705.09786, 2017.

[9]

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[10]

W. D. Hillis and G. L. Steele, Jr. Data parallel algorithms. Commun. ACM, 29(12):1170--1183, Dec. 1986.

Digital Library

[11]

K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359--366, 1989.

[12]

N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In ACM SIGMOD Record, volume 27, pages 106--117. ACM, 1998.

Digital Library

[13]

A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

[14]

C.-G. Lee and Z. Ma. The generalized quadratic assignment problem. 01 2004.

[15]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. In OSDI, pages 583--598, Berkeley, CA, USA.

Digital Library

[16]

S. Luo, Z. J. Gao, M. Gubanov, L. L. Perez, and C. Jermaine. Scalable linear algebra on a relational database system. In ICDE 2017, pages 523--534. IEEE, 2017.

Digital Library

[17]

B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693--701, 2011.

Digital Library

[18]

S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.

[19]

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V. Le, G. E. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538, 2017.

[20]

A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3(1--2):703--710, Sept. 2010.

Digital Library

[21]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017.

Digital Library

[22]

E. W. Weisstein. Einstein summation. 2014.

[23]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49--67, June 2015.

Cited By

Yuan YTang BZhou TZhang ZQin J(2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682000
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Schüle MNeumann TKemper A(2024)The Duck’s BrainDatenbank-Spektrum10.1007/s13222-024-00485-2Online publication date: 9-Oct-2024
https://doi.org/10.1007/s13222-024-00485-2
Show More Cited By

Recommendations

Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning

A number of popular systems, most notably Google's TensorFlow, have been implemented from the ground up to support machine learning tasks. We consider how to make a very small set of changes to a modern relational database management system (RDBMS) to ...
The RDBMS Industry: A Northern California Perspective

This article describes the origins and development of the relational database management systems (RDBMS) industry, focusing on the firms IBM, Oracle, Ingres, Informix, and Sybase in the 1980s. The author analyzes the industry's evolution in terms of the ...
Managing Large Scale Unstructured Data with RDBMS
DASC '13: Proceedings of the 2013 IEEE 11th International Conference on Dependable, Autonomic and Secure Computing

With the rapid development of information technology, the needs of unstructured data storage and processing is growing rapidly, which develops a new requirement for the database storage. Traditional row-oriented relational databases appear to be ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 49, Issue 1

March 2020

72 pages

ISSN:0163-5808

DOI:10.1145/3422648

Editors:
Rada Chirkova
North Carolina State University
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Wim Martens
University of Bayreuth
,
Divesh Srivastava
ATT research
,
Pinar Tözü
IBM Almaden Research Center
,
Marianne Winslett
University of Illinois
,
Jun Yang
Duke University
,
Azza Abouzied
NYU
,
Lyublena Antova
Datometry
,
Aaron J. Elmore
University of Chicago
,
Kyriakos Mouratidis
Singapore Management University
,
Dan Olteanu
University of Oxford
,
Immanuel Trummer
Cornell University
,
Yannis Velegrakis
Utrecht University

Issue’s Table of Contents

Copyright © 2020 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2020

Published in SIGMOD Volume 49, Issue 1

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
160
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yuan YTang BZhou TZhang ZQin J(2024)nsDB: Architecting the Next Generation Database by Integrating Neural and Symbolic SystemsProceedings of the VLDB Endowment10.14778/3681954.368200017:11(3283-3289)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682000
Ben Amara OHadouaj SMeneghetti N(2024)StarfishDB: A Query Execution Engine for Relational Probabilistic ProgrammingProceedings of the ACM on Management of Data10.1145/36549882:3(1-31)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654988
Schüle MNeumann TKemper A(2024)The Duck’s BrainDatenbank-Spektrum10.1007/s13222-024-00485-2Online publication date: 9-Oct-2024
https://doi.org/10.1007/s13222-024-00485-2
Guan HMasood SDwarampudi MGunda VMin HYu LNag SZou J(2023)A Comparison of End-to-End Decision Forest Inference PipelinesProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624656(200-215)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3620678.3624656
Chakhar SBrahmia Z(2023)Towards an Integrated Rough Set and Data Modelling Framework for Data Management and Knowledge ExtractionArtificial Intelligence and Smart Environment10.1007/978-3-031-26254-8_116(800-805)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-3-031-26254-8_116
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang CIves ZBonifati AEl Abbadi A(2022)In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data ShuffleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526150(1286-1300)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526150
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467867
Geerts FMuñoz TRiveros CVan den Bussche JVrgoč D(2021)Matrix Query LanguagesACM SIGMOD Record10.1145/3503780.350378250:3(6-19)Online publication date: 2-Dec-2021
https://dl.acm.org/doi/10.1145/3503780.3503782
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Sagadeeva SBoehm MLi GLi ZIdreos SSrivastava D(2021)SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model DebuggingProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457323(2290-2299)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457323

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents