research-article

Hardware-Efficient Data Imputation through DBMS Extensibility

Authors:

Hubert Mohr-Daurat,

Georgios Theodorakis,

Holger PirkAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 17, Issue 11

Pages 3497 - 3510

https://doi.org/10.14778/3681954.3682016

Published: 30 August 2024 Publication History

Abstract

The separation of data and code/queries has served Data Management Systems (DBMSs) well for decades. However, while the resulting soundness and rigidity are the basis for many performance-oriented optimizations, it lacks the flexibility to efficiently support modern data science applications: data cleansing, data ingestion/augmentation or generative models. To support such applications without sacrificing performance, we propose a new logical data model called Homoiconic Collection Processing (HCP). HCP is based on a well-known Meta-Programming concept called Homoiconicity (a unified representation for code and data).

In a DBMS, HCP supports the storage of "classic" relational data but also allows the storage and evaluation of code fragments we refer to as "Homoiconic Expressions". Homoiconic Expressions enable applications such as data imputation directly in the database kernel. Implemented naïvely, such flexibility would come at a prohibitive cost in terms of performance. To make HCP performance-competitive with highly-tuned in-memory DBMSs, we develop a novel storage and processing model called Shape-Wise Microbatching (SWM) and implement it in a system called BOSS. BOSS is performance-competitive with high-performance DBMSs while offering unprecedented extensibility. To demonstrate the extensibility, we implement an extension for impute-and-query workloads: BOSS outperforms state-of-the-art homoiconic runtimes and data imputation systems by two to five orders of magnitude.

References

[1]

Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. Proceedings of the VLDB Endowment (2006), 1151--1154.

Digital Library

[2]

Apache. 2023. Open Office Calc. Retrieved 2024-01-22 from https://www.openoffice.org/product/calc.html

[3]

Apple. 2023. Apple Numbers. Retrieved 2024-01-22 from https://www.apple.com/numbers/

[4]

Apache Arrow. 2023. Retrieved 2023-02-24 from https://arrow.apache.org

[5]

Mangesh Bendre, Vipul Venkataraman, Xinyan Zhou, Kevin Chang, and Aditya Parameswaran. 2018. Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 113--124.

[6]

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2015. Julia: A Fresh Approach to Numerical Computing. arXiv:1411.1607 [cs] Retrieved 2024-05-31T17:23:06Z from http://arxiv.org/abs/1411.1607

[7]

Peter Boncz and M. L Kersten. 2002. Monet: A next-Generation DBMS Kernel for Query-Intensive Applications. Ph.D. Dissertation. Universiteit van Amsterdam.

[8]

Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61--76.

[9]

José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query Optimization for Dynamic Imputation. Proceedings of the VLDB Endowment 10, 11 (Aug. 2017), 1310--1321.

Digital Library

[10]

Center for Disease Control. 2016. National Health and Nutrition Examination Survey (2013--2014). Retrieved 2023-02-24 from https://wwwn.cdc.gov/nchs/nhanes/ContinuousNhanes/Default.aspx?BeginYear=2013

[11]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 785--794.

Digital Library

[12]

Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1247--1261.

Digital Library

[13]

Chris Clifton, Hector Garcia-Molina, and Robert Hagmann. 1988. The Design of a Document Database. In Proceedings of the ACM Conference on Document Processing Systems.

Digital Library

[14]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. (Proc. VLDB Endow.), Vol. 7. 315--326.

[15]

George P. Copeland and Setrag Khoshafian. 1985. A Decomposition Storage Model. Proceedings of the 1985 ACM SIGMOD international conference on management of data (1985).

[16]

The Transaction Processing Council. 2013. TPC-H Benchmark (Revision 2.16.0). Retrieved 2023-02-24 from http://www.tpc.org/tpch/

[17]

Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, and Stan Zdonik. 2014. Tupleware: Redefining Modern Analytics. arXiv:1406.6667 [cs] (July 2014). arXiv:1406.6667 [cs]

[18]

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.

[19]

Frans Faerber, Alfons Kemper, Per-Åke Larson, Justin Levandoski, Tjomas Neumann, and Andrew Pavlo. 2017. Main Memory Database Systems. Foundations and Trends® in Databases 8, 1--2 (2017), 1--130.

[20]

Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. 2015. The Racket Manifesto. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

[21]

Tim Fischer, Denis Hirn, and Torsten Grust. 2022. Snakes on a Plan: Compiling Python Functions into Plain SQL Queries. In Proceedings of the 2022 International Conference on Management of Data. ACM, Philadelphia PA USA, 2389--2392.

Digital Library

[22]

FreeCodeCamp. 2016. New Coder Survey. Retrieved 2023-02-24 from https://www.kaggle.com/freecodecamp/2016-new-coder-survey-

[23]

D. Gawlick, D. Lenkov, A. Yalamanchi, and L. Chernobrod. 2004. Applications for Expression Data in Relational Database Systems. In Proceedings. 20th International Conference on Data Engineering. IEEE Comput. Soc, Boston, MA, USA, 609--620.

[24]

Google. 2023. Google Sheets. Retrieved 2024-01-22 from https://www.google.com/sheets/about/

[25]

Georg Gottlob and Roberto V Zicari. 1988. Closed World Databases Opened through Null Values. In VLDB, Vol. 88. 50--61.

[26]

Intel. 2023. VTune Profiler. Retrieved 2023-02-24 from https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

[27]

Mohamed Ismail and G. Edward Suh. 2018. Quantitative Overhead Analysis for Python. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Raleigh, NC, 36--47.

[28]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. Advances in neural information processing systems 30 (2017).

Digital Library

[29]

Oliver Kennedy and Christoph Koch. 2010. PIP: A Database System for Great and Small Expectations. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, Long Beach, CA, USA, 157--168.

[30]

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything You Always Wanted to Know about Compiled and Vectorized Queries but Were Afraid to Ask. Proceedings of the VLDB Endowment 11, 13 (Sept. 2018), 2209--2222.

Digital Library

[31]

Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proceedings of the VLDB Endowment 13, 5 (Jan. 2020), 768--782.

Digital Library

[32]

Yiming Lin and Sharad Mehrotra. 2023. ZIP: Lazy Imputation during Query Processing. Proceedings of the VLDB Endowment 17, 1 (Sept. 2023), 28--40.

Digital Library

[33]

John McCarthy. 1960. Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Commun. ACM 3, 4 (April 1960), 184--195.

Digital Library

[34]

Microsoft. 2024. Excel. Retrieved 2024-01-22 from https://www.microsoft.com/microsoft-365/excel

[35]

MonetDB. 2024. TPC-H Scripts for MonetDB. Retrieved 2024-07-13 from https://github.com/MonetDBSolutions/tpch-scripts

[36]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proceedings of the VLDB Endowment 4, 9 (2011), 539--550.

Digital Library

[37]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-Learn: Machine Learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.

[38]

H. Pirk, F. Funke, M. Grund, T. Neumann, U. Leser, S. Manegold, A. Kemper, and M. Kersten. 2013. CPU and Cache Efficient Management of Memory-Resident Databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, Brisbane, QLD, 14--25.

Digital Library

[39]

Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo-a Vector Algebra for Portable Database Performance on Modern Hardware. Proceedings of the VLDB Endowment 9, 14 (2016), 1707--1718.

Digital Library

[40]

PostgreSQL. 2023. What Is PostgreSQL? Retrieved 2023-02-24 from https://www.postgresql.org/about/

[41]

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: Unbiased Boosting with Categorical Features. Advances in neural information processing systems 31 (2018).

Digital Library

[42]

Mark Raasveldt and Hannes Mühleisen. 2020. Data Management for Data Science Towards Embedded Analytics. (2020).

[43]

Karthik Ramachandra and Kwanghyun Park. 2019. BlackMagic: Automatic Inlining of Scalar UDFs into SQL Queries with Froid. Proceedings of the VLDB Endowment 12, 12 (Aug. 2019), 1810--1813.

Digital Library

[44]

Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César Galindo-Legaria, and Conor Cunningham. 2017. Froid: Optimization of Imperative Programs in a Relational Database. Proceedings of the VLDB Endowment 11, 4 (Dec. 2017), 432--444.

Digital Library

[45]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo-Clean: Holistic Data Repairs with Probabilistic Inference. Proceedings of the VLDB Endowment 10, 11 (Aug. 2017), 1190--1201.

Digital Library

[46]

El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-Driven Data Cleaning. Proc. VLDB Endow. 14, 11 (2021), 2546--2554.

Digital Library

[47]

Maximilian E. Schüle, Jakob Huber, Alfons Kemper, and Thomas Neumann. 2020. Freedom for the SQL-Lambda: Just-in-Time-Compiling User-Injected Functions in PostgreSQL. In 32nd International Conference on Scientific and Statistical Database Management. ACM, Vienna Austria, 1--12.

Digital Library

[48]

Scikit-learn. 2022. Scikit-Learn: Imputation of Missing Values. Retrieved 2023-02-24 from https://scikit-learn.org/stable/modules/impute.html

[49]

Moritz Sichert and Thomas Neumann. 2022. User-Defined Operators: Efficiently Integrating Custom Algorithms into Modern Databases. (2022).

[50]

Michael Stonebraker and Lawrence A Rowe. 1986. The Design of POSTGRES. Proceedings of the 1986 ACM SIGMOD international conference on management of data (1986), 340--355.

Digital Library

[51]

Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey Naughton, and Val Tannen. 2017. M-Tables: Representing Missing Data. (2017).

[52]

Devesh Tiwari and Yan Solihin. 2012. Architectural Characterization and Similarity Analysis of Sunspider and Google's V8 Javascript Benchmarks. In 2012 IEEE International Symposium on Performance Analysis of Systems & Software. IEEE, New Brunswick, NJ, USA, 221--232.

Digital Library

[53]

Wolfram. 2022. How To Replace or Remove Invalid or Missing Data. Retrieved 2023-02-24 from https://reference.wolfram.com/language/howto/ReplaceOrRemoveInvalidOrMissingData.html

[54]

Stephen Wolfram. 1991. Mathematica: A System for Doing Mathematics by Computer. Addison Wesley Longman Publishing Co., Inc.

Digital Library

[55]

Aravind Yalamanchi, Jagannathan Srinivasan, and Dieter Gawlick. 2003. Managing Expressions as Data in Relational Database Systems. (2003).

[56]

Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An on-Demand Approach to ETL. Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1578--1589.

Digital Library

[57]

Marcin Zukowski, Peter A Boncz, Niels Nes, and Sándor Héman. 2005. MonetDB/X100-A DBMS in the CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

Index Terms

Hardware-Efficient Data Imputation through DBMS Extensibility
1. Information systems
  1. Data management systems
2. Theory of computation
  1. Theory and algorithms for application domains

Index terms have been assigned to the content through auto-classification.

Recommendations

Can we analyze big data inside a DBMS?
DOLAP '13: Proceedings of the sixteenth international workshop on Data warehousing and OLAP

Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, ...
Internal Data Imputation in Data Warehouse Dimensions
Database and Expert Systems Applications
Abstract
Missing data occur commonly in data warehouses and may generate data usefulness problems. Thus, it is essential to address missing data to carry out a better analysis. There exists data imputation methods for missing data in fact tables, but not ...
Privacy-preserving imputation of missing data

Handling missing data is a critical step to ensuring good results in data mining. Like most data mining algorithms, existing privacy-preserving data mining algorithms assume data is complete. In order to maintain privacy in the data mining process while ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 17, Issue 11

July 2024

1039 pages

Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 August 2024

Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
47
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)8

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents