research-article

Compressed linear algebra for large-scale machine learning

Authors:

Ahmed Elgohary,

Matthias Boehm,

Frederick R. Reiss,

Berthold ReinwaldAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 9, Issue 12

Pages 960 - 971

https://doi.org/10.14778/2994509.2994515

Published: 01 August 2016 Publication History

Abstract

Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed operations. Hence, we initiate work on compressed linear algebra (CLA), in which lightweight database compression techniques are applied to matrices and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representations. We contribute effective column compression schemes, cache-conscious operations, and an efficient sampling-based compression algorithm. Our experiments show that CLA achieves in-memory operations performance close to the uncompressed case and good compression ratios that allow us to fit larger datasets into available memory. We thereby obtain significant end-to-end performance improvements up to 26x or reduced memory requirements.

References

[1]

M. Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR, 2016.

[2]

A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. VLDB J., 23(6), 2014.

Digital Library

[3]

A. Ashari et al. An Efficient Two-Dimensional Blocking Strategy for Sparse Matrix-Vector Multiplication on GPUs. In ICS (Intl. Conf. on Supercomputing), 2014.

Digital Library

[4]

A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP (Principles and Practice of Parallel Programming), 2015.

Digital Library

[5]

M. A. Bassiouni. Data Compression in Scientific and Statistical Databases. TSE (Trans. SW Eng.), 11(10), 1985.

Digital Library

[6]

N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In SC (Supercomputing Conf.), 2009.

Digital Library

[7]

J. Bergstra et al. Theano: a CPU and GPU Math Expression Compiler. In SciPy, 2010.

[8]

K. S. Beyer et al. On Synopses for Distinct-Value Estimation Under Multiset Operations. In SIGMOD, 2007.

Digital Library

[9]

B. Bhattacharjee et al. Efficient Index Compression in DB2 LUW. PVLDB, 2(2), 2009.

Digital Library

[10]

S. Bhattacherjee et al. PStore: An Efficient Storage Framework for Managing Scientific Data. In SSDBM, 2014.

Digital Library

[11]

C. Binnig et al. Dictionary-based Order-preserving String Compression for Main Memory Column Stores. In SIGMOD, 2009.

Digital Library

[12]

M. Boehm et al. Declarative Machine Learning -- A Classification of Basic Properties and Types. CoRR, 2016.

[13]

L. Bottou. The infinite MNIST dataset. http://leon.bottou.org/projects/infimnist.

[14]

M. Charikar et al. Towards Estimation Error Guarantees for Distinct Values. In SIGMOD, 2000.

Digital Library

[15]

R. Chitta et al. Approximate Kernel k-means: Solution to Large Scale Kernel Clustering. In KDD, 2011.

Digital Library

[16]

J. Cohen et al. MAD Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009.

Digital Library

[17]

C. Constantinescu and M. Lu. Quick Estimation of Data Compression and De-duplication for Large Storage Systems. In CCP (Data Compression, Comm. and Process.), 2011.

Digital Library

[18]

G. V. Cormack. Data Compression on a Database System. Commun. ACM, 28(12), 1985.

Digital Library

[19]

S. Das et al. Ricardo: Integrating R and Hadoop. In SIGMOD, 2010.

Digital Library

[20]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.

Digital Library

[21]

A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011.

Digital Library

[22]

I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 1953.

[23]

G. Graefe and L. D. Shapiro. Data Compression and Database Performance. In Applied Computing, 1991.

[24]

P. J. Haas and L. Stokes. Estimating the Number of Classes in a Finite Population. J. Amer. Statist. Assoc., 93(444), 1998.

[25]

D. Harnik et al. Estimation of Deduplication Ratios in Large Data Sets. In MSST (Mass Storage Sys. Tech.), 2012.

[26]

D. Harnik et al. To Zip or not to Zip: Effective Resource Usage for Real-Time Compression. In FAST, 2013.

Digital Library

[27]

B. Huang et al. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013.

Digital Library

[28]

B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015.

Digital Library

[29]

S. Idreos et al. Estimating the Compression Fraction of an Index using Sampling. In ICDE, 2010.

[30]

N. L. Johnson et al. Univariate Discrete Distributions. Wiley, New York, 2nd edition, 1992.

[31]

V. Karakasis et al. An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication. TPDS (Trans. Par. and Dist. Systems), 24(10), 2013.

Digital Library

[32]

D. Kernert et al. SLACID - Sparse Linear Algebra in a Column-Oriented In-Memory Database System. In SSDBM, 2014.

Digital Library

[33]

H. Kimura et al. Compression Aware Physical Database Design. PVLDB, 4(10), 2011.

Digital Library

[34]

K. Kourtis et al. Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression. In CF (Computing Frontiers), 2008.

Digital Library

[35]

H. Lang et al. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In SIGMOD, 2016.

Digital Library

[36]

P. Larson et al. SQL Server Column Store Indexes. In SIGMOD, 2011.

Digital Library

[37]

M. Lichman. UCI Machine Learning Repository: Higgs, Covertype, US Census (1990). archive.ics.uci.edu/ml/.

[38]

P. E. O'Neil. Model 204 Architecture and Performance. In High Performance Transaction Systems. 1989.

Digital Library

[39]

Oracle. Data Warehousing Guide, 11g Release 1, 2007.

[40]

V. Raman and G. Swart. How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations. In VLDB, 2006.

Digital Library

[41]

V. Raman et al. DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB, 6(11), 2013.

Digital Library

[42]

Y. Saad. SPARSKIT: a basic tool kit for sparse matrix computations - Version 2, 1994.

[43]

M. Stonebraker et al. C-Store: A Column-oriented DBMS. In VLDB, 2005.

Digital Library

[44]

M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011.

Digital Library

[45]

Sysbase. IQ 15.4 System Administration Guide, 2013.

[46]

G. Valiant and P. Valiant. Estimating the Unseen: An n/log(n)-sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs. In STOC, 2011.

Digital Library

[47]

T. Westmann et al. The Implementation and Performance of Compressed Databases. SIGMOD Record, 29(3), 2000.

Digital Library

[48]

S. Williams et al. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. In SC (Supercomputing Conf.), 2007.

Digital Library

[49]

K. Wu et al. Optimizing Bitmap Indices With Efficient Compression. TODS, 31(1), 2006.

Digital Library

[50]

L. Yu et al. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD, 2015.

Digital Library

[51]

M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012.

Digital Library

[52]

C. Zhang et al. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014.

Digital Library

Cited By

Kara ANikolic MOlteanu DZhang H(2024)F-IVM: analytics over relational databases under updatesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00817-w
Liu ZZhang YZhu YZhang RYang TXie KWang SLi TCui B(2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588910
Baunsgaard SBoehm M(2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588682
Show More Cited By

Recommendations

Compressed linear algebra for declarative large-scale machine learning

Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the data into single-node or distributed main memory to ...
Scaling Machine Learning via Compressed Linear Algebra

Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or ...
Compressed linear algebra for large-scale machine learning

Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 9, Issue 12

August 2016

345 pages

ISSN:2150-8097

Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2016

Published in PVLDB Volume 9, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)85
Downloads (Last 6 weeks)7

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kara ANikolic MOlteanu DZhang H(2024)F-IVM: analytics over relational databases under updatesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00817-w33:4(903-929)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00817-w
Liu ZZhang YZhu YZhang RYang TXie KWang SLi TCui B(2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588910
Baunsgaard SBoehm M(2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588682
Huang HLi YSun JZhu XZhang JLuo LLi JWang Z(2023)P4SGD: Programmable Switch Enhanced Model-Parallel Training on Generalized Linear Models on Distributed FPGAsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327925534:8(2311-2324)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3279255
Zogaj FCambronero JRinard MCito J(2021)Doing more with lessProceedings of the VLDB Endowment10.14778/3476249.347626214:11(2059-2072)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476262
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467867
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Jiang JGan SLiu YWang FAlonso GKlimovic ASingla AWu WZhang CLi GLi ZIdreos SSrivastava D(2021)Towards Demystifying Serverless Machine Learning TrainingProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3459240(857-871)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3459240
Sagadeeva SBoehm MLi GLi ZIdreos SSrivastava D(2021)SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model DebuggingProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457323(2290-2299)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457323
Luo SJankov DYuan BJermaine CLi GLi ZIdreos SSrivastava D(2021)Automatic Optimization of Matrix Implementations for Distributed Machine Learning and Linear AlgebraProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457317(1222-1234)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457317
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents