research-article

Fast, Explainable View Detection to Characterize Exploration Queries

Authors:

Thibault Sellam,

Martin KerstenAuthors Info & Claims

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Article No.: 20, Pages 1 - 12

https://doi.org/10.1145/2949689.2949692

Published: 18 July 2016 Publication History

Abstract

The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them understand their query results. In particular, we focus on medium to high dimension spaces: if the database contains dozens or hundreds of columns, which variables should they inspect? We propose to detect subspaces in which the users' selection is different from the rest of the database. From this idea, we built Ziggy, a tuple description engine. Ziggy can detect informative subspaces, and it can explain why it recommends them, with visualizations and natural language. It can cope with mixed data, missing values, and it penalizes redundancy. Our experiments reveal that it is up to an order of magnitude faster than state-of-the-art feature selection algorithms, at minimal accuracy costs.

References

[1]

A. Abouzied, J. Hellerstein, and A. Silberschatz. Dataplay: interactive tweaking and example-driven correction of graphical database queries. In USIT, pages 207--218. ACM, 2012.

Digital Library

[2]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. SIGMOD, pages 94--105, 1998.

Digital Library

[3]

F. Angiulli, F. Fassetti, and L. Palopoli. Detecting outlying properties of exceptional objects. ACM TODS, 2009.

Digital Library

[4]

J. Cohen. Statistical power analysis for the behavioral sciences. Lawrence Erlbaum Associates, 1977.

[5]

K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Explore-by-example: An automatic query steering framework for interactive data exploration. In Proc. SIGMOD, pages 517--528, 2014.

Digital Library

[6]

L. Duan, G. Tang, J. Pei, J. Bailey, A. Campbell, and C. Tang. Mining outlying aspects on numeric data. Data Mining and Knowl. Discovery, pages 1--36, 2014.

Digital Library

[7]

R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 1915.

[8]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, pages 1157--1182, 2003.

Digital Library

[9]

E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In Proc. VLDB, pages 211--222, 1999.

Digital Library

[10]

F. Li and H. Jagadish. Constructing an interactive natural language interface for relational databases. In Proc. VLDB, pages 73--84, 2014.

Digital Library

[11]

E. Liarou and S. Idreos. dbtouch in action database kernels for touch-based data exploration. In Proc. ICDE, pages 1262--1265, 2014.

[12]

J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and Natural-Language description of nonparametric regression models. In AAAI, 2014.

Digital Library

[13]

E. Loekito and J. Bailey. Mining influential attributes that capture class and group contrast behaviour. In Proc. CIKM, pages 971--980, 2008.

Digital Library

[14]

H. V. Nguyen, E. Muller, and K. Bohm. 4s: Scalable subspace search scheme overcoming traditional apriori processing. In IEEE Big Data, pages 359--367, 2013.

[15]

J. K. Patel and C. B. Read. Handbook of the normal distribution, pages 204--205. CRC Press, 1996.

[16]

P. Pébay. Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Tech. report, Sandia National Laboratories, 2008.

[17]

T. Sellam and M. L. Kersten. Meet charles, big data query advisor. In CIDR, 2013.

[18]

T. Sellam, E. Müller, and M. Kersten. Semi-automated exploration of data warehouses. In proc. CIKM, pages 1321--1330, 2015.

Digital Library

[19]

M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: Efficient data-driven visualization recommendations to support visual analytics. Proc. VLDB, pages 2182--2193, 2015.

Digital Library

[20]

J. Vreeken, M. Van Leeuwen, and A. Siebes. Characterising the difference. In Proc. SIGKDD, pages 765--774, 2007.

Digital Library

[21]

L. Wasserman. All of statistics: a concise course in statistical inference. Springer, 2013.

Digital Library

[22]

G. I. Webb, S. Butler, and D. Newlands. On detecting differences between groups. In Proc. SIGKDD, pages 256--265, 2003.

Digital Library

[23]

Y. Xie and P. S. Yu. Max-clique: a top-down graph-based approach to frequent pattern mining. In ICDM, pages 1139--1144, 2010.

Digital Library

[24]

J. Zhang and H. Wang. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowledge and information systems, pages 333--355, 2006.

Digital Library

[25]

M. Zhu and A. Ghodsi. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, pages 918--930, 2006.

Digital Library

Cited By

Sharaf MMafrur RZuccon G(2023)Efficient Diversification for Recommending Aggregate Data VisualizationsIEEE Access10.1109/ACCESS.2023.328345711(62261-62280)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3283457
Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Ma HYang ZJing YHe ZWang X(2019)Answering unique topic queries with dynamic thresholdWorld Wide Web10.1007/s11280-018-0528-722:1(39-58)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s11280-018-0528-7
Show More Cited By

Fast, Explainable View Detection to Characterize Exploration Queries
1. Information systems
  1. Information systems applications

Recommendations

View selection for real conjunctive queries

Given a query workload, a database and a set of constraints, the view-selection problem is to select views to materialize so that the constraints are satisfied and the views can be used to compute the queries in the workload efficiently. A typical ...
Interactive data exploration using semantic windows
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) ...
Large-scale Data Exploration Using Explanatory Regression Functions

Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

July 2016

290 pages

ISBN:9781450342155

DOI:10.1145/2949689

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM '16

SSDBM '16: Conference on Scientific and Statistical Database Management

July 18 - 20, 2016

Budapest, Hungary

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
109
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sharaf MMafrur RZuccon G(2023)Efficient Diversification for Recommending Aggregate Data VisualizationsIEEE Access10.1109/ACCESS.2023.328345711(62261-62280)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3283457
Alvarez-Ayllon APalomo-Duarte MDodero J(2019)Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping StudyIEEE Access10.1109/ACCESS.2018.28822447(10691-10717)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2018.2882244
Ma HYang ZJing YHe ZWang X(2019)Answering unique topic queries with dynamic thresholdWorld Wide Web10.1007/s11280-018-0528-722:1(39-58)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1007/s11280-018-0528-7
Mafrur RSharaf MKhan HCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)DiVEProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271744(1123-1132)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271744
Wang CChakrabarti KGuo YFarooq F(2018)Efficient Attribute Recommendation with Probabilistic GuaranteeProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219984(2387-2396)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219984
Ehsan HSharaf MChrysanthis P(2018)Efficient Recommendation of Aggregate Data VisualizationsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.276563430:2(263-277)Online publication date: 1-Feb-2018
https://doi.org/10.1109/TKDE.2017.2765634

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents