research-article

Sampling cube: a framework for statistical olap over sampling data

Authors:

Yizhou SunAuthors Info & Claims

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Pages 779 - 790

https://doi.org/10.1145/1376616.1376695

Published: 09 June 2008 Publication History

Abstract

Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results.

In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.

References

[1]

C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In SIGMOD?01.

Digital Library

[2]

O. Benjelloun, A. D. Sarma, A. Y. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB?06.

Digital Library

[3]

K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD?99.

Digital Library

[4]

D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In VLDB?05.

Digital Library

[5]

D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Olap over imprecise data with domain constraints. In VLDB?07.

Digital Library

[6]

Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD?03.

Digital Library

[7]

Bee-Chung Chen, Lei Chen, Yi Lin, and Raghu Ramakrishnan. Prediction cubes. In VLDB?05.

Digital Library

[8]

P.-A. Chirita, C. S. Firan, and W. Nejdl. Personalized query expansion for the web. In SIGIR?07.

Digital Library

[9]

H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE TKDE?03.

Digital Library

[10]

J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational operator generalizing group-by, cross-tab and sub-totals. In ICDE?96.

Digital Library

[11]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. In Journal of Machine Learning Research, 2003.

Digital Library

[12]

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD?96.

Digital Library

[13]

W. L. Hays. Statistics. CBS College Publishing, New York, NY, 1981.

[14]

L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient cube: How to summarize the semantics of a data cube. In VLDB?02.

Digital Library

[15]

L. V. S. Lakshmanan, J. Pei, and Y. Zhao. QC-Trees: An efficient summary structure for semantic OLAP. In SIGMOD?03.

Digital Library

[16]

X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cubing approach. In VLDB?04.

Digital Library

[17]

T. M. Mitchell. Machine Learning. McGraw Hill, 1997.

Digital Library

[18]

L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: A review. SIGKDD Explorations, 2004.

Digital Library

[19]

V. Raman and J. M. Hellerstein. Potter?s wheel: An interactive data cleaning system. In VLDB?01.

Digital Library

[20]

Y. Sismanis and N. Roussopoulos. The complexity of fully materialized coalesced cubes. In VLDB?04.

Digital Library

[21]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann, 2005.

Digital Library

[22]

D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. In VLDB?03.

Digital Library

Cited By

Ni TSugiura KIshikawa YLu K(2024)Guaranteeing an Exact Error Bound for Bounded Approximate Query ProcessingJournal of Information Processing10.2197/ipsjjip.32.90332(903-915)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.903
John SKoch C(2022)High-Dimensional Data CubesProceedings of the VLDB Endowment10.14778/3565838.356583915:13(3828-3840)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565839
Zhang JYiu MTang BLi Q(2022)Fast Error-Bounded Distance Distribution ComputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305824134:11(5364-5377)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TKDE.2021.3058241
Show More Cited By

Index Terms

Sampling cube: a framework for statistical olap over sampling data
1. Information systems
  1. Information systems applications

Recommendations

Graph cube: on warehousing and OLAP multidimensional networks
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the so-called multidimensional networks. Data warehouses and OLAP (Online ...
AND/OR importance sampling
UAI'08: Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence

The paper introduces AND/OR importance sampling for probabilistic graphical models. In contrast to importance sampling, AND/OR importance sampling caches samples in the AND/OR space and then extracts a new sample mean from the stored samples. We prove ...
Semi-closed cube: an effective approach to trading off data cube size and query response time

The results of data cube will occupy huge amount of disk space when the base table is of a large number of attributes. A new type of data cube, compact data cube like condensed cube and quotient cube, was proposed to solve the problem. It compresses ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

June 2008

1396 pages

ISBN:9781605581026

DOI:10.1145/1376616

General Chairs:
Laks V. S. Lakshmanan
University of British Columbia, Canada
,
Raymond T. Ng
University of British Columbia, Canada
,
Dennis Shasha
New York University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

olap sampling

Qualifiers

Research-article

Conference

SIGMOD/PODS '08

Sponsor:

SIGMOD/PODS '08: SIGMOD/PODS '08 - International Conference on Management of Data

June 9 - 12, 2008

Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
758
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ni TSugiura KIshikawa YLu K(2024)Guaranteeing an Exact Error Bound for Bounded Approximate Query ProcessingJournal of Information Processing10.2197/ipsjjip.32.90332(903-915)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.903
John SKoch C(2022)High-Dimensional Data CubesProceedings of the VLDB Endowment10.14778/3565838.356583915:13(3828-3840)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565839
Zhang JYiu MTang BLi Q(2022)Fast Error-Bounded Distance Distribution ComputationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.305824134:11(5364-5377)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TKDE.2021.3058241
Zhang MWang H(2021)LAQP: Learning-based approximate query processingInformation Sciences10.1016/j.ins.2020.09.070546(1113-1134)Online publication date: Feb-2021
https://doi.org/10.1016/j.ins.2020.09.070
Yu JSarwat M(2020)Turbocharging Geospatial Visualization Dashboards via a Materialized Sampling Cube Approach2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00105(1165-1176)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00105
Silva RHirata Cde Castro Lima J(2020)Big high-dimension data cube designs for hybrid memory systemsKnowledge and Information Systems10.1007/s10115-020-01505-962:12(4717-4746)Online publication date: 26-Aug-2020
https://doi.org/10.1007/s10115-020-01505-9
Procopio MScheidegger CWu EChang R(2019)Selective Wander Join: Fast Progressive Visualizations for Data JoinsInformatics10.3390/informatics60100146:1(14)Online publication date: 25-Mar-2019
https://doi.org/10.3390/informatics6010014
Trummer IWang YMahankali SBoncz PManegold SAilamaki ADeshpande AKraska T(2019)A Holistic Approach for Query Evaluation andResult Vocalization in Voice-Based OLAPProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300089(936-953)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3300089
Peng JZhang DWang JPei JDas GJermaine CBernstein P(2018)AQP++Proceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183747(1477-1492)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183747
Kamat NNandi A(2018)A Session-Based Approach to Fast-But-Approximate Interactive Data Cube ExplorationACM Transactions on Knowledge Discovery from Data10.1145/307064812:1(1-26)Online publication date: 13-Feb-2018
https://dl.acm.org/doi/10.1145/3070648
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents