skip to main content
10.1145/1866480.1866510acmconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

How to juggle columns: an entropy-based approach for table compression

Published: 16 August 2010 Publication History

Abstract

Many relational databases exhibit complex dependencies between data attributes, caused either by the nature of the underlying data or by explicitly denormalized schemas. In data warehouse scenarios, calculated key figures may be materialized or hierarchy levels may be held within a single dimension table. Such column correlations and the resulting data redundancy may result in additional storage requirements. They may also result in bad query performance if inappropriate independence assumptions are made during query compilation. In this paper, we tackle the specific problem of detecting functional dependencies between columns to improve the compression rate for column-based database systems, which both reduces main memory consumption and improves query performance. Although a huge variety of algorithms have been proposed for detecting column dependencies in databases, we maintain that increased data volumes and recent developments in hardware architectures demand novel algorithms with much lower runtime overhead and smaller memory footprint. Our novel approach is based on entropy estimations and exploits a combination of sampling and multiple heuristics to render it applicable for a wide range of use cases. We demonstrate the quality of our approach by means of an implementation within the SAP NetWeaver Business Warehouse Accelerator. Our experiments indicate that our approach scales well with the number of columns and produces reliable dependence structure information. This both reduces memory consumption and improves performance for nontrivial queries.

References

[1]
}}J. Astola and I. Virtanen. Entropy correlation coefficient, a measure of statistical dependence for categorized data. Technical Report 4, Lappeenranta University of Technology, 1982.
[2]
}}P. G. Brown and P. J. Haas. Bhunt: Automatic discovery of fuzzy algebraic constraints in relational data. In Proc. 29th VLDB, pages 668--679, 2003.
[3]
}}N. Bruno and S. Chaudhuri. Exploiting Statistics on Query Expressions for Optimization. In Proc. ACM SIGMOD 2002, Madison, WI, pages 263--274, 2002.
[4]
}}N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A Multidimensional Workload-aware Histogram. In Proc. ACM SIGMOD 2001, Santa Barbara, CA, pages 211--222, 2001.
[5]
}}A. Chao and T.-J. Shen. Nonparametric estimation of shannon's index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10:429--443, 2003.
[6]
}}J. Cheng, D. A. Bell, and W. Liu. Learning belief networks from data: An information theory based approach. In Proc. 6th CIKM, pages 325--331, 1997.
[7]
}}I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika 40(3 and 4), pages 237--264, 1953.
[8]
}}D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.
[9]
}}Y. Huhtala, J. K. P. Porkka, and H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. In Proc. ICDE 1998, Orlando, FL, pages 392--401, 1998.
[10]
}}I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In Proc. 30th VLDB, pages 1341--1344, 2004.
[11]
}}S. Kullback. Information Theory and Statistics. Wiley, 1959.
[12]
}}L. Lim, M. Wang, and J. Vitter. SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads. In Proc. VLDB 2003, Berlin, Germany, pages 369--380, 2003.
[13]
}}H. Mannila and K.-J. Räihä. Dependency Inference. In Proc. VLDB 1987, Brighton, England, pages 155--158, 1987.
[14]
}}P. O'Neil, E. O'Neil, and X. Chen. The star schema benchmark. http://www.cs.umb.edu/poneil/StarSchemaB.PDF.
[15]
}}J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J. Kouloumdjian. Towards the Reverse Engineering of Denormalized Relational Databases. In Proc. ICDE 1996, New Orleans, LA, pages 218--227, 1996.
[16]
}}C. E. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27, 1948.
[17]
}}H. Theil. On the use of information theory concepts in the analysis of financial statements. Management Science, 15(9):459--480, 1969.
[18]
}}V. Q. Vu, B. Yu, and R. Kass. Coverage-adjusted entropy estimation. In Statistics in Medicine, volume Volume 26 Issue 21, pages 4039--4060, 2007.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IDEAS '10: Proceedings of the Fourteenth International Database Engineering & Applications Symposium
August 2010
282 pages
ISBN:9781605589008
DOI:10.1145/1866480
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

IDEAS '10
Sponsor:
  • ACM
  • Concordia University

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)4
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media