skip to main content
10.1145/2949689.2949692acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Fast, Explainable View Detection to Characterize Exploration Queries

Published: 18 July 2016 Publication History

Abstract

The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them understand their query results. In particular, we focus on medium to high dimension spaces: if the database contains dozens or hundreds of columns, which variables should they inspect? We propose to detect subspaces in which the users' selection is different from the rest of the database. From this idea, we built Ziggy, a tuple description engine. Ziggy can detect informative subspaces, and it can explain why it recommends them, with visualizations and natural language. It can cope with mixed data, missing values, and it penalizes redundancy. Our experiments reveal that it is up to an order of magnitude faster than state-of-the-art feature selection algorithms, at minimal accuracy costs.

References

[1]
A. Abouzied, J. Hellerstein, and A. Silberschatz. Dataplay: interactive tweaking and example-driven correction of graphical database queries. In USIT, pages 207--218. ACM, 2012.
[2]
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. SIGMOD, pages 94--105, 1998.
[3]
F. Angiulli, F. Fassetti, and L. Palopoli. Detecting outlying properties of exceptional objects. ACM TODS, 2009.
[4]
J. Cohen. Statistical power analysis for the behavioral sciences. Lawrence Erlbaum Associates, 1977.
[5]
K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proc. SIGMOD, pages 517--528, 2014.
[6]
L. Duan, G. Tang, J. Pei, J. Bailey, A. Campbell, and C. Tang. Mining outlying aspects on numeric data. Data Mining and Knowl. Discovery, pages 1--36, 2014.
[7]
R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 1915.
[8]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, pages 1157--1182, 2003.
[9]
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In Proc. VLDB, pages 211--222, 1999.
[10]
F. Li and H. Jagadish. Constructing an interactive natural language interface for relational databases. In Proc. VLDB, pages 73--84, 2014.
[11]
E. Liarou and S. Idreos. dbtouch in action database kernels for touch-based data exploration. In Proc. ICDE, pages 1262--1265, 2014.
[12]
J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and Natural-Language description of nonparametric regression models. In AAAI, 2014.
[13]
E. Loekito and J. Bailey. Mining influential attributes that capture class and group contrast behaviour. In Proc. CIKM, pages 971--980, 2008.
[14]
H. V. Nguyen, E. Muller, and K. Bohm. 4s: Scalable subspace search scheme overcoming traditional apriori processing. In IEEE Big Data, pages 359--367, 2013.
[15]
J. K. Patel and C. B. Read. Handbook of the normal distribution, pages 204--205. CRC Press, 1996.
[16]
P. Pébay. Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Tech. report, Sandia National Laboratories, 2008.
[17]
T. Sellam and M. L. Kersten. Meet charles, big data query advisor. In CIDR, 2013.
[18]
T. Sellam, E. Müller, and M. Kersten. Semi-automated exploration of data warehouses. In proc. CIKM, pages 1321--1330, 2015.
[19]
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: Efficient data-driven visualization recommendations to support visual analytics. Proc. VLDB, pages 2182--2193, 2015.
[20]
J. Vreeken, M. Van Leeuwen, and A. Siebes. Characterising the difference. In Proc. SIGKDD, pages 765--774, 2007.
[21]
L. Wasserman. All of statistics: a concise course in statistical inference. Springer, 2013.
[22]
G. I. Webb, S. Butler, and D. Newlands. On detecting differences between groups. In Proc. SIGKDD, pages 256--265, 2003.
[23]
Y. Xie and P. S. Yu. Max-clique: a top-down graph-based approach to frequent pattern mining. In ICDM, pages 1139--1144, 2010.
[24]
J. Zhang and H. Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowledge and information systems, pages 333--355, 2006.
[25]
M. Zhu and A. Ghodsi. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, pages 918--930, 2006.

Cited By

View all
  1. Fast, Explainable View Detection to Characterize Exploration Queries

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
    July 2016
    290 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data exploration
    2. data description
    3. subspace search

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SSDBM '16

    Acceptance Rates

    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media