Never before it was so easy and inexpensive to gather data in amounts which were beyond imagination only a few years in the past. However, as we all are aware, this richness in data goes hand in hand with a poverty in insight, as data understanding cannot keep up with this data deluge. Today, this phenomenon is not confined to highly specialized applications like particle physics at CERN - all areas suffer, from science and engineering over business and administration to society at large, with imminent implications for all of us. There was and is a severe need for theories, methods, tools, and best practices that help us cope with the "volume, velocity, variety, and veracity" of data.
The aim of the International Conference on Scientific and Statistical Database Management (SSDBM) series is to bring database researchers, practitioners and developers together with scienfic domain experts to exchange the most recent research results on database techniques, concepts, tools and applications for scientific and statistical applicat-ions. The 28th SSDBM took place in Budapest, Hungary, between June 18-20, 2016, organized by the Hungarian Academy of Sciences Wigner Research Centre for Physics.
The conference this year had ten sessions, five for presenting research papers, two for keynote talks, one dedicated to demonstrations and poster viewing, a tutorial session, and a panel discussion. The research and poster committee has enjoyed the variety and interest of the submissions, which came from quite different angles and sub-disciplines of the broad areas of statistical and scientific data management. Altogether, 63 papers were submitted for review out of which 21 were accepted as full papers, three as posters and four as demonstrations. The full paper acceptance rate thus was 33%, the overall acceptance rate was 44%. An innovation this year was the introduction of a tutorial session to advocate tools and technologies which might be of wide interest to the research community. The tutorial on array databases has solicited keen interest, given the prevalence of array data in applications from the scientific domain (think time series or sequences of biological observations) as well as beyond (financial and transport data etc). The committee also decied to provide an opportunity to students participating and the conference to present non-peer-reviewed research posters in a special section so as to lower the entry barrier into scientific publishing.
The keynote speakers were invited to represent the two main areas of scientific data research: data-intensive system development and scientific applications. Prof. Volker Markl from the Technical University Berlin gave a keynote speech about Apache Flink, an open source scalable system for batch and stream processing developed by a team under his supervision, an how this system helps researchers implement deep data analysis workflows by automatic parallelization, optimization and efficient execution. Prof. István Csabai from the Eötvös Loránd University of Buda-pest discussed the challenges of massive data analysis in a wide range of scientific, fields from cosmology via gen-omics to social sciences.
In an interdisciplinary panel titled "Bye Bye Big Data - all problems solved, finally?" a lively discussion took place on the state of affairs in Big Data, how much the database field is contributing visibly, and where future avenues in terms of research areas and technological contributions can be found.
The conference organizers are grateful to all paper authors for the high-quality submissions, and to the research program committee, the demo committee, and the external reviewers for the thorough and timely reviews.
We hope you will find, like we do, the program of this year's conference interesting, inspiring and relevant for the scientific and statistical data management community.
Peter Baumann - General Chair
Ioana Manolescu - Program Chair
Efficient Feedback Collection for Pay-as-you-go Source Selection
Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Functional Dependencies Unleashed for Scalable Data Exchange
We address the problem of efficiently evaluating target functional dependencies (fds) in the Data Exchange (DE) process. Target fds naturally occur in many DE scenarios, including the ones in Life Sciences in which multiple source relations need to be ...
Graph-based modelling of query sets for differential privacy
Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are ...
PAMPAS: Privacy-Aware Mobile Participatory Sensing Using Secure Probes
Mobile participatory sensing could be used in many applications such as vehicular traffic monitoring, pollution tracking, or even health surveying. However, its success depends on finding a solution for querying large numbers of users which protects ...
Geometric Graph Indexing for Similarity Search in Scientific Databases
Searching a database for similar graphs is a critical task in many scientific applications, such as in drug discovery, geoinformatics, or pattern recognition. Typically, graph edit distance is used to estimate the similarity of non-identical graphs, ...
Efficient Similarity Search across Top-k Lists under the Kendall's Tau Distance
We consider the problem of similarity search in a set of top-k lists under the generalized Kendall's Tau distance. This distance describes how related two rankings are in terms of discordantly ordered items. We consider pair- and triplets-based indices ...
Monitoring Spatial Coverage of Trending Topics in Twitter
Most messages posted in Twitter usually discuss an ongoing event, triggering a series of tweets that together may constitute a trending topic (e.g., #election2012, #jesuischarlie, #oscars2016). Sometimes, such a topic may be trending only locally, ...
SPOTHOT: Scalable Detection of Geo-spatial Events in Large Textual Streams
The analysis of social media data poses several challenges: first of all, the data sets are very large, secondly they change constantly, and third they are heterogeneous, consisting of text, images, geographic locations and social connections. In this ...
Efficient Maintenance of All-Pairs Shortest Distances
Computing shortest distances is a central task in many graph applications. Since it is impractical to recompute shortest distances from scratch every time the graph changes, many algorithms have been proposed to incrementally maintain shortest distances ...
Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs
Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle ...
PIEJoin: Towards Parallel Set Containment Joins
The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive ...
Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters
Cross-match is a central operation in astronomic databases to integrate multiple catalogs of celestial objects. With the rapid development of new astronomy projects, large amounts of astronomic catalogs are generated and require fast cross-match with ...
Regular Path Queries on Massive Graphs
Regular Path Queries (RPQs) represent a powerful tool for querying graph databases and are of particular interest, because they form the building blocks of other query languages, and because they can be used in many theoretical or practical contexts for ...
SolveDB: Integrating Optimization Problem Solvers Into SQL Databases
Many real-world decision problems involve solving optimization problems based on data in an SQL database. Traditionally, solving such problems requires combining a DBMS with optimization software packages for each required class of problems (e.g. linear ...
Compact and queryable representation of raster datasets
Compact data structures combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always ...
Vectorized UDFs in Column-Stores
Data Scientists rely on vector-based scripting languages such as R, Python and MATLAB to perform ad-hoc data analysis on potentially large data sets. When facing large data sets, they are only efficient when data is processed using vectorized or bulk ...
SPECTRA: Continuous Query Processing for RDF Graph Streams Over Sliding Windows
This paper proposes a new approach for the the incremental evaluation of RDF graph streams over sliding windows. Our system, called "SPECTRA", combines a novel formof RDF graph summarisation, a new incremental evaluation method and adaptive indexing ...
Pruning Forests to Find the Trees
The vast majority of phylogenetic databases do not support a declarative querying platform using which their contents can be flexibly and conveniently accessed. The template based query interfaces they support do not allow arbitrary speculative queries. ...
Framework for real-time clustering over sliding windows
Clustering queries over sliding windows require maintaining cluster memberships that change as windows slide. To address this, the Generic 2-phase Continuous Summarization framework (G2CS) utilizes a generation based window maintenance approach where ...
Fast, Explainable View Detection to Characterize Exploration Queries
The aim of data exploration is to get acquainted with an unfamiliar database. Typically, explorers operate by trial and error: they submit a query, study the result, and refine their query subsequently. In this paper, we investigate how to help them ...
Novel Data Reduction Based on Statistical Similarity
Applications such as scientific simulations and power grid monitoring are generating so much data quickly that compression is essential to reduce storage requirement or transmission capacity. To achieve better compression, one is often willing to ...
Data Exchange with MapReduce: A First Cut
Data exchange is one of the oldest database problems, being of both practical and theoretical interest. Given the pace at which heterogeneous data are published on the web, thanks to initiatives such as Linked Data and Open Science, scalability of data ...
Privacy or Security?: Take A Look And Then Decide
Big data paradigm is currently the leading paradigm for data production and management. As a matter of fact, new information are generated at high rates in specialized fields (e.g., cybersecurity scenario). This may cause that the events to be studied ...
SMS: Stable Matching Algorithm using Skylines
In this paper we show how skylines can be used to improve the stable matching algorithm with asymmetric preference sets for men and women. The skyline set of men (or women) in a dataset comprises of those who are not worse off in all the qualities in ...
Array Database Scalability: Intercontinental Queries on Petabyte Datasets
With the deluge of scientific big data affecting a large variety of research institutions, support for large multidimensional arrays has gained traction in the database community in the past decade. Array databases aim to cover the gap left by ...
Demonstrating KDBMS: A Knowledge-based Database Management System
We demonstrate a KDBMS, a prototype system which seamlessly integrates Knowledge base and DBMS. While state-of-the-art approaches, i.e., Ontology-based data access, denoted as OBDA, use ontologies to only query data stored in relational databases using ...
SciServer Compute: Bringing Analysis Close to the Data
SciServer Compute uses Jupyter notebooks running within server-side Docker containers attached to large relational databases and file storage to bring advanced analysis capabilities close to the data. SciServer Compute is a component of SciServer, a big-...
Selective Scan for Filter Operator of SciDB
Recently there has been an increasing interest in analyzing scientific data generated by observations and scientific experiments. For managing these data efficiently, SciDB, a multi-dimensional array-based DBMS, is suggested. When SciDB processes a ...
- Proceedings of the 28th International Conference on Scientific and Statistical Database Management