MisDetect: Iterative Mislabel Detection using Early Loss
Supervised machine learning (ML) models trained on data with mislabeled instances often produce inaccurate results due to label errors. Traditional methods of detecting mislabeled instances rely on data proximity, where an instance is considered ...
Capturing More Associations by Referencing External Graphs
This paper studies association rule discovery in a graph G1 by referencing an external graph G2 with overlapping information. The objective is to enrich G1 with relevant properties and links from G2. As a testbed, we consider Graph Association Rules (...
QTCS: Efficient Query-Centered Temporal Community Search
Temporal community search is an important task in graph analysis, which has been widely used in many practical applications. However, existing methods suffer from two major defects: (i) they only require that the target result contains the query vertex q,...
DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release
Machine learning models are known to memorize private data to reduce their training loss, which can be inadvertently exploited by privacy attacks such as model inversion and membership inference. To protect against these attacks, differential privacy (DP)...
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study
This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput ...
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, ...
Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective
Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN ...
LION: Fast and High-Resolution Network Kernel Density Visualization
Network Kernel Density Visualization (NKDV) has often been used in a wide range of applications, e.g., criminology, transportation science, and urban planning. However, NKDV is computationally expensive, which cannot be scalable to large-scale datasets ...
Performance-Based Pricing for Federated Learning via Auction
Many machine learning techniques rely on plenty of training data. However, data are often possessed unequally by different entities, with a large proportion of data being held by a small number of data-rich entities. It can be challenging to incentivize ...
OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams
How to get insights from relational data streams in a timely manner is a hot research topic. Data streams can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as ...
Influence Maximization via Vertex Countering
Competitive viral marketing considers the product competition of multiple companies, where each user may adopt one product and propagate the product to other users. Existing studies focus on a traditional seeding strategy where a company only selects ...
Optimizing Data Acquisition to Enhance Machine Learning Performance
In this paper, we study how to acquire labeled data points from a large data pool to enrich a training set for enhancing supervised machine learning (ML) performance. The state-of-the-art solution is the clustering-based training set selection (CTS) ...
Minimum Strongly Connected Subgraph Collection in Dynamic Graphs
Real-world directed graphs are dynamically changing, and it is important to identify and maintain the strong connectivity information between nodes, which is useful in numerous applications. Given an input graph G, we study a new problem, minimum ...
FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data
Centralised data management systems (e.g., data lakes) support queries over multi-source heterogeneous data. However, the query results from multiple sources commonly involve between-source conflicts, which makes query results unreliable and confusing ...
POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance
- David Justen,
- Daniel Ritter,
- Campbell Fraser,
- Andrew Lamb,
- Allison Lee,
- Thomas Bodner,
- Mhd Yamen Haddad,
- Steffen Zeuch,
- Volker Markl,
- Matthias Boehm
Join ordering and query optimization are crucial for query performance but remain challenging due to unknown or changing characteristics of query intermediates, especially for complex queries with many joins. Over the past two decades, a spectrum of ...
DAHA: Accelerating GNN Training with Data and Hardware Aware Execution Planning
Graph neural networks (GNNs) have been gaining a reputation for effective modeling of graph data. Yet, it is challenging to train GNNs efficiently. Many frameworks have been proposed but most of them suffer from high batch preparation cost and data ...
FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast Storage
Our extensive experiments reveal that existing key-value stores (KVSs) achieve high performance at the expense of a huge memory footprint that is often impractical or unacceptable. Even with the emerging ultra-fast byte-addressable persistent memory (PM),...
How Do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses
The tedious grunt work involved in data preparation (prep) before ML reduces ML user productivity. It is also a roadblock to industrial-scale cloud AutoML workflows that build ML models for millions of datasets. One important data prep step for ML is ...
CGgraph: An Ultra-Fast Graph Processing System on Modern Commodity CPU-GPU Co-processor
In recent years, many CPU-GPU heterogeneous graph processing systems have been developed in both academic and industrial to facilitate large-scale graph processing in various applications, e.g., social networks and biological networks. However, the ...
FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data
While both the database and high-performance computing (HPC) communities utilize lossless compression methods to minimize floating-point data size, a disconnect persists between them. Each community designs and assesses methods in a domain-specific ...
PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data Compression
Exponential growth in data collection is creating significant challenges for data storage and analytics latency. Approximate Query Processing (AQP) has long been touted as a solution for accelerating analytics on large datasets, however, there is still ...
MetaStore: Analyzing Deep Learning Meta-Data at Scale
The process of training deep learning models produces a huge amount of meta-data, including but not limited to losses, hidden feature embeddings, and gradients. Model diagnosis tools have been developed to analyze losses and feature embeddings with the ...
RTScan: Efficient Scan with Ray Tracing Cores
Indexing is a core technique for accelerating predicate evaluation in databases. After many years of effort, the indexing performance has reached its peak on the existing hardware infrastructure. We propose to use ray tracing (RT) cores to move the ...
FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training
- Kezhao Huang,
- Haitian Jiang,
- Minjie Wang,
- Guangxuan Xiao,
- David Wipf,
- Xiang Song,
- Quan Gan,
- Zengfeng Huang,
- Jidong Zhai,
- Zheng Zhang
A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on ...
Sorting on Byte-Addressable Storage: The Resurgence of Tree Structure
The tree structure is notably popular for storage and indexing; however, tree-based sorting such as tree sort is rarely used in practice. Nevertheless, with the advent of byte-addressable storage (BAS), the tree structure captures our attention with its ...
Efficient Placement of Decomposable Aggregation Functions for Stream Processing over Large Geo-Distributed Topologies
A recent trend in stream processing is offloading the computation of decomposable aggregation functions (DAF) from cloud nodes to geo-distributed fog/edge devices to decrease latency and improve energy efficiency. However, deploying DAFs on low-end ...
AeonG: An Efficient Built-in Temporal Support in Graph Databases
Real-world graphs are often dynamic and evolve over time. It is crucial for storing and querying a graph's evolution in graph databases. However, existing works either suffer from high storage overhead or lack efficient temporal query support, or both. ...
Subjects
Currently Not Available