skip to main content
research-article

Hardware-Efficient Data Imputation through DBMS Extensibility

Published: 30 August 2024 Publication History

Abstract

The separation of data and code/queries has served Data Management Systems (DBMSs) well for decades. However, while the resulting soundness and rigidity are the basis for many performance-oriented optimizations, it lacks the flexibility to efficiently support modern data science applications: data cleansing, data ingestion/augmentation or generative models. To support such applications without sacrificing performance, we propose a new logical data model called Homoiconic Collection Processing (HCP). HCP is based on a well-known Meta-Programming concept called Homoiconicity (a unified representation for code and data).
In a DBMS, HCP supports the storage of "classic" relational data but also allows the storage and evaluation of code fragments we refer to as "Homoiconic Expressions". Homoiconic Expressions enable applications such as data imputation directly in the database kernel. Implemented naïvely, such flexibility would come at a prohibitive cost in terms of performance. To make HCP performance-competitive with highly-tuned in-memory DBMSs, we develop a novel storage and processing model called Shape-Wise Microbatching (SWM) and implement it in a system called BOSS. BOSS is performance-competitive with high-performance DBMSs while offering unprecedented extensibility. To demonstrate the extensibility, we implement an extension for impute-and-query workloads: BOSS outperforms state-of-the-art homoiconic runtimes and data imputation systems by two to five orders of magnitude.

References

[1]
Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. Proceedings of the VLDB Endowment (2006), 1151--1154.
[2]
Apache. 2023. Open Office Calc. Retrieved 2024-01-22 from https://www.openoffice.org/product/calc.html
[3]
Apple. 2023. Apple Numbers. Retrieved 2024-01-22 from https://www.apple.com/numbers/
[4]
Apache Arrow. 2023. Retrieved 2023-02-24 from https://arrow.apache.org
[5]
Mangesh Bendre, Vipul Venkataraman, Xinyan Zhou, Kevin Chang, and Aditya Parameswaran. 2018. Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 113--124.
[6]
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2015. Julia: A Fresh Approach to Numerical Computing. arXiv:1411.1607 [cs] Retrieved 2024-05-31T17:23:06Z from http://arxiv.org/abs/1411.1607
[7]
Peter Boncz and M. L Kersten. 2002. Monet: A next-Generation DBMS Kernel for Query-Intensive Applications. Ph.D. Dissertation. Universiteit van Amsterdam.
[8]
Peter Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In Technology Conference on Performance Evaluation and Benchmarking. Springer, 61--76.
[9]
José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query Optimization for Dynamic Imputation. Proceedings of the VLDB Endowment 10, 11 (Aug. 2017), 1310--1321.
[10]
Center for Disease Control. 2016. National Health and Nutrition Examination Survey (2013--2014). Retrieved 2023-02-24 from https://wwwn.cdc.gov/nchs/nhanes/ContinuousNhanes/Default.aspx?BeginYear=2013
[11]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. 785--794.
[12]
Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. Katara: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1247--1261.
[13]
Chris Clifton, Hector Garcia-Molina, and Robert Hagmann. 1988. The Design of a Document Database. In Proceedings of the ACM Conference on Document Processing Systems.
[14]
Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. (Proc. VLDB Endow.), Vol. 7. 315--326.
[15]
George P. Copeland and Setrag Khoshafian. 1985. A Decomposition Storage Model. Proceedings of the 1985 ACM SIGMOD international conference on management of data (1985).
[16]
The Transaction Processing Council. 2013. TPC-H Benchmark (Revision 2.16.0). Retrieved 2023-02-24 from http://www.tpc.org/tpch/
[17]
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, and Stan Zdonik. 2014. Tupleware: Redefining Modern Analytics. arXiv:1406.6667 [cs] (July 2014). arXiv:1406.6667 [cs]
[18]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.
[19]
Frans Faerber, Alfons Kemper, Per-Åke Larson, Justin Levandoski, Tjomas Neumann, and Andrew Pavlo. 2017. Main Memory Database Systems. Foundations and Trends® in Databases 8, 1--2 (2017), 1--130.
[20]
Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. 2015. The Racket Manifesto. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[21]
Tim Fischer, Denis Hirn, and Torsten Grust. 2022. Snakes on a Plan: Compiling Python Functions into Plain SQL Queries. In Proceedings of the 2022 International Conference on Management of Data. ACM, Philadelphia PA USA, 2389--2392.
[22]
FreeCodeCamp. 2016. New Coder Survey. Retrieved 2023-02-24 from https://www.kaggle.com/freecodecamp/2016-new-coder-survey-
[23]
D. Gawlick, D. Lenkov, A. Yalamanchi, and L. Chernobrod. 2004. Applications for Expression Data in Relational Database Systems. In Proceedings. 20th International Conference on Data Engineering. IEEE Comput. Soc, Boston, MA, USA, 609--620.
[24]
Google. 2023. Google Sheets. Retrieved 2024-01-22 from https://www.google.com/sheets/about/
[25]
Georg Gottlob and Roberto V Zicari. 1988. Closed World Databases Opened through Null Values. In VLDB, Vol. 88. 50--61.
[26]
Intel. 2023. VTune Profiler. Retrieved 2023-02-24 from https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
[27]
Mohamed Ismail and G. Edward Suh. 2018. Quantitative Overhead Analysis for Python. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, Raleigh, NC, 36--47.
[28]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. Advances in neural information processing systems 30 (2017).
[29]
Oliver Kennedy and Christoph Koch. 2010. PIP: A Database System for Great and Small Expectations. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, Long Beach, CA, USA, 157--168.
[30]
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. 2018. Everything You Always Wanted to Know about Compiled and Vectorized Queries but Were Afraid to Ask. Proceedings of the VLDB Endowment 11, 13 (Sept. 2018), 2209--2222.
[31]
Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proceedings of the VLDB Endowment 13, 5 (Jan. 2020), 768--782.
[32]
Yiming Lin and Sharad Mehrotra. 2023. ZIP: Lazy Imputation during Query Processing. Proceedings of the VLDB Endowment 17, 1 (Sept. 2023), 28--40.
[33]
John McCarthy. 1960. Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I. Commun. ACM 3, 4 (April 1960), 184--195.
[34]
Microsoft. 2024. Excel. Retrieved 2024-01-22 from https://www.microsoft.com/microsoft-365/excel
[35]
MonetDB. 2024. TPC-H Scripts for MonetDB. Retrieved 2024-07-13 from https://github.com/MonetDBSolutions/tpch-scripts
[36]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proceedings of the VLDB Endowment 4, 9 (2011), 539--550.
[37]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-Learn: Machine Learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.
[38]
H. Pirk, F. Funke, M. Grund, T. Neumann, U. Leser, S. Manegold, A. Kemper, and M. Kersten. 2013. CPU and Cache Efficient Management of Memory-Resident Databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, Brisbane, QLD, 14--25.
[39]
Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo-a Vector Algebra for Portable Database Performance on Modern Hardware. Proceedings of the VLDB Endowment 9, 14 (2016), 1707--1718.
[40]
PostgreSQL. 2023. What Is PostgreSQL? Retrieved 2023-02-24 from https://www.postgresql.org/about/
[41]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: Unbiased Boosting with Categorical Features. Advances in neural information processing systems 31 (2018).
[42]
Mark Raasveldt and Hannes Mühleisen. 2020. Data Management for Data Science Towards Embedded Analytics. (2020).
[43]
Karthik Ramachandra and Kwanghyun Park. 2019. BlackMagic: Automatic Inlining of Scalar UDFs into SQL Queries with Froid. Proceedings of the VLDB Endowment 12, 12 (Aug. 2019), 1810--1813.
[44]
Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César Galindo-Legaria, and Conor Cunningham. 2017. Froid: Optimization of Imperative Programs in a Relational Database. Proceedings of the VLDB Endowment 11, 4 (Dec. 2017), 432--444.
[45]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo-Clean: Holistic Data Repairs with Probabilistic Inference. Proceedings of the VLDB Endowment 10, 11 (Aug. 2017), 1190--1201.
[46]
El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-Driven Data Cleaning. Proc. VLDB Endow. 14, 11 (2021), 2546--2554.
[47]
Maximilian E. Schüle, Jakob Huber, Alfons Kemper, and Thomas Neumann. 2020. Freedom for the SQL-Lambda: Just-in-Time-Compiling User-Injected Functions in PostgreSQL. In 32nd International Conference on Scientific and Statistical Database Management. ACM, Vienna Austria, 1--12.
[48]
Scikit-learn. 2022. Scikit-Learn: Imputation of Missing Values. Retrieved 2023-02-24 from https://scikit-learn.org/stable/modules/impute.html
[49]
Moritz Sichert and Thomas Neumann. 2022. User-Defined Operators: Efficiently Integrating Custom Algorithms into Modern Databases. (2022).
[50]
Michael Stonebraker and Lawrence A Rowe. 1986. The Design of POSTGRES. Proceedings of the 1986 ACM SIGMOD international conference on management of data (1986), 340--355.
[51]
Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey Naughton, and Val Tannen. 2017. M-Tables: Representing Missing Data. (2017).
[52]
Devesh Tiwari and Yan Solihin. 2012. Architectural Characterization and Similarity Analysis of Sunspider and Google's V8 Javascript Benchmarks. In 2012 IEEE International Symposium on Performance Analysis of Systems & Software. IEEE, New Brunswick, NJ, USA, 221--232.
[53]
Wolfram. 2022. How To Replace or Remove Invalid or Missing Data. Retrieved 2023-02-24 from https://reference.wolfram.com/language/howto/ReplaceOrRemoveInvalidOrMissingData.html
[54]
Stephen Wolfram. 1991. Mathematica: A System for Doing Mathematics by Computer. Addison Wesley Longman Publishing Co., Inc.
[55]
Aravind Yalamanchi, Jagannathan Srinivasan, and Dieter Gawlick. 2003. Managing Expressions as Data in Relational Database Systems. (2003).
[56]
Ying Yang, Niccolò Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy. 2015. Lenses: An on-Demand Approach to ETL. Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1578--1589.
[57]
Marcin Zukowski, Peter A Boncz, Niels Nes, and Sándor Héman. 2005. MonetDB/X100-A DBMS in the CPU Cache. IEEE Data Eng. Bull. 28, 2 (2005), 17--22.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 17, Issue 11
July 2024
1039 pages
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 30 August 2024
Published in PVLDB Volume 17, Issue 11

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 47
    Total Downloads
  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)8
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media