skip to main content
Volume 17, Issue 6February 2024
Editor:
Publisher:
  • VLDB Endowment
ISSN:2150-8097
Reflects downloads up to 15 Jan 2025Bibliometrics
Skip Table Of Content Section
MisDetect: Iterative Mislabel Detection using Early Loss

Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered ...

Capturing More Associations by Referencing External Graphs

This paper studies association rule discovery in a graph G1 by referencing an external graph G2 with overlapping information. The objective is to enrich G1 with relevant properties and links from G2. As a testbed, we consider Graph Association Rules (...

QTCS: Efficient Query-Centered Temporal Community Search

Temporal community search is an important task in graph analysis, which has been widely used in many practical applications. However, existing methods suffer from two major defects: (i) they only require that the target result contains the query vertex q,...

DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release

Machine learning models are known to memorize private data to reduce their training loss, which can be inadvertently exploited by privacy attacks such as model inversion and membership inference. To protect against these attacks, differential privacy (DP)...

How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput ...

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, ...

Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective

Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN ...

LION: Fast and High-Resolution Network Kernel Density Visualization

Network Kernel Density Visualization (NKDV) has often been used in a wide range of applications, e.g., criminology, transportation science, and urban planning. However, NKDV is computationally expensive, which cannot be scalable to large-scale datasets ...

Performance-Based Pricing for Federated Learning via Auction

Many machine learning techniques rely on plenty of training data. However, data are often possessed unequally by different entities, with a large proportion of data being held by a small number of data-rich entities. It can be challenging to incentivize ...

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams

How to get insights from relational data streams in a timely manner is a hot research topic. Data streams can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as ...

Influence Maximization via Vertex Countering

Competitive viral marketing considers the product competition of multiple companies, where each user may adopt one product and propagate the product to other users. Existing studies focus on a traditional seeding strategy where a company only selects ...

Optimizing Data Acquisition to Enhance Machine Learning Performance

In this paper, we study how to acquire labeled data points from a large data pool to enrich a training set for enhancing supervised machine learning (ML) performance. The state-of-the-art solution is the clustering-based training set selection (CTS) ...

Minimum Strongly Connected Subgraph Collection in Dynamic Graphs

Real-world directed graphs are dynamically changing, and it is important to identify and maintain the strong connectivity information between nodes, which is useful in numerous applications. Given an input graph G, we study a new problem, minimum ...

FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data

Centralised data management systems (e.g., data lakes) support queries over multi-source heterogeneous data. However, the query results from multiple sources commonly involve between-source conflicts, which makes query results unreliable and confusing ...

POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance

Join ordering and query optimization are crucial for query performance but remain challenging due to unknown or changing characteristics of query intermediates, especially for complex queries with many joins. Over the past two decades, a spectrum of ...

DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning

Graph neural networks (GNNs) have been gaining a reputation for effective modeling of graph data. Yet, it is challenging to train GNNs efficiently. Many frameworks have been proposed but most of them suffer from high batch preparation cost and data ...

FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast Storage

Our extensive experiments reveal that existing key-value stores (KVSs) achieve high performance at the expense of a huge memory footprint that is often impractical or unacceptable. Even with the emerging ultra-fast byte-addressable persistent memory (PM),...

How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses

The tedious grunt work involved in data preparation (prep) before ML reduces ML user productivity. It is also a roadblock to industrial-scale cloud AutoML workflows that build ML models for millions of datasets. One important data prep step for ML is ...

CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processor

In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the ...

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data

While both the database and high-performance computing (HPC) communities utilize lossless compression methods to minimize floating-point data size, a disconnect persists between them. Each community designs and assesses methods in a domain-specific ...

research-article
PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data Compression

Exponential growth in data collection is creating significant challenges for data storage and analytics latency. Approximate Query Processing (AQP) has long been touted as a solution for accelerating analytics on large datasets, however, there is still ...

MetaStore: Analyzing Deep Learning Meta-Data at Scale

The process of training deep learning models produces a huge amount of meta-data, including but not limited to losses, hidden feature embeddings, and gradients. Model diagnosis tools have been developed to analyze losses and feature embeddings with the ...

RTScan: Efficient Scan with Ray Tracing Cores

Indexing is a core technique for accelerating predicate evaluation in databases. After many years of effort, the indexing performance has reached its peak on the existing hardware infrastructure. We propose to use ray tracing (RT) cores to move the ...

FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training

A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on ...

Sorting on Byte-Addressable Storage: The Resurgence of Tree Structure

The tree structure is notably popular for storage and indexing; however, tree-based sorting such as tree sort is rarely used in practice. Nevertheless, with the advent of byte-addressable storage (BAS), the tree structure captures our attention with its ...

Efficient Placement of Decomposable Aggregation Functions for Stream Processing over Large Geo-Distributed Topologies

A recent trend in stream processing is offloading the computation of decomposable aggregation functions (DAF) from cloud nodes to geo-distributed fog/edge devices to decrease latency and improve energy efficiency. However, deploying DAFs on low-end ...

AeonG: An Efficient Built-in Temporal Support in Graph Databases

Real-world graphs are often dynamic and evolve over time. It is crucial for storing and querying a graph's evolution in graph databases. However, existing works either suffer from high storage overhead or lack efficient temporal query support, or both. ...

Subjects

Currently Not Available

Comments