Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJuly 2024
<italic>InSS</italic>: An Intelligent Scheduling Orchestrator for Multi-GPU Inference With Spatio-Temporal Sharing
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 10Oct. 2024, Pages 1735–1748https://doi.org/10.1109/TPDS.2024.3430063As the applications of AI proliferate, it is critical to increase the throughput of online DNN inference services. Multi-process service (MPS) improves the utilization rate of GPU resources by spatial-sharing, but it also brings unique challenges. First, ...
- research-articleFebruary 2024
End-to-End Bayesian Networks Exact Learning in Shared Memory
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 4April 2024, Pages 634–645https://doi.org/10.1109/TPDS.2024.3366471Bayesian networks are important Machine Learning models with many practical applications in, e.g., biomedicine and bioinformatics. The problem of Bayesian networks learning is <inline-formula><tex-math notation="LaTeX">$\mathcal {NP}$</tex-math><...
- research-articleJanuary 2024
Multi-Agent Deep Reinforcement Learning Framework for Renewable Energy-Aware Workflow Scheduling on Distributed Cloud Data Centers
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 4April 2024, Pages 604–615https://doi.org/10.1109/TPDS.2024.3360448The ever-increasing demand for the cloud computing paradigm has resulted in the widespread deployment of multiple datacenters, the operations of which consume very high levels of energy. The carbon footprint resulting from these operations threatens ...
- research-articleDecember 2023
A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network Training
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 4April 2024, Pages 577–591https://doi.org/10.1109/TPDS.2023.3343570With the increasing volumes of data samples and deep neural network (DNN) models, efficiently scaling the training of DNN models has become a significant challenge for server clusters with AI accelerators in terms of memory and computing efficiency. ...
- research-articleDecember 2023
Graft: Efficient Inference Serving for Hybrid Deep Learning With SLO Guarantees via DNN Re-Alignment
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 2Feb. 2024, Pages 280–296https://doi.org/10.1109/TPDS.2023.3340518Deep neural networks (DNNs) have been widely adopted for various mobile inference tasks, yet their ever-increasing computational demands are hindering their deployment on resource-constrained mobile devices. Hybrid deep learning partitions a DNN into two ...
-
- research-articleNovember 2023
Batch Jobs Load Balancing Scheduling in Cloud Computing Using Distributional Reinforcement Learning
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 1Jan. 2024, Pages 169–185https://doi.org/10.1109/TPDS.2023.3334519In cloud computing, how to reasonably allocate computing resources for batch jobs to ensure the load balance of dynamic clusters and meet user requests is an important and challenging task. Most existing studies are based on deep Q network, which utilizes ...
- research-articleNovember 2023
US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 1Jan. 2024, Pages 123–139https://doi.org/10.1109/TPDS.2023.3331372The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches ...
- research-articleNovember 2023
SpatialSSJP: QoS-Aware Adaptive Approximate Stream-Static Spatial Join Processor
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 1Jan. 2024, Pages 73–88https://doi.org/10.1109/TPDS.2023.3330669The widespread adoption of Internet of Things (IoT) motivated the emergence of mixed workloads in smart cities, where fast arriving geo-referenced big data streams are joined with archive tables, aiming at enriching streams with descriptive attributes ...
- research-articleOctober 2023
FedHAP: Federated Hashing With Global Prototypes for Cross-Silo Retrieval
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 4April 2024, Pages 592–603https://doi.org/10.1109/TPDS.2023.3324426Deep hashing has been widely applied in large-scale data retrieval due to its superior retrieval efficiency and low storage cost. However, data are often scattered in data silos with privacy concerns, so performing centralized data storage and retrieval ...
- research-articleSeptember 2023
Consistent Low Latency Scheduler for Distributed Key-Value Stores
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 34, Issue 12Dec. 2023, Pages 3012–3027https://doi.org/10.1109/TPDS.2023.3315777Nowadays, the distributed key-value stores have become the basic building block for large-scale cloud applications. In large-scale distributed key-value stores, many key-value access operations, which will be processed in parallel on different servers, ...
- research-articleSeptember 2023
RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 34, Issue 12Dec. 2023, Pages 3266–3279https://doi.org/10.1109/TPDS.2023.3317388With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a ...
- research-articleSeptember 2023
Task Placement and Resource Allocation for Edge Machine Learning: A GNN-Based Multi-Agent Reinforcement Learning Paradigm
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 34, Issue 12Dec. 2023, Pages 3073–3089https://doi.org/10.1109/TPDS.2023.3313779Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ...
- research-articleDecember 2022
Tenant-Grained Request Scheduling in Software-Defined Cloud Computing
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4654–4671https://doi.org/10.1109/TPDS.2022.3199031Cloud providers host various services for tenants’ requests (e.g., software-as-a-service) and seek to serve as many requests as possible for revenue maximization. Considering a large number of requests, the previous works on fine-grained request ...
- research-articleDecember 2022
A Bi-Objective Learn-and-Deploy Scheduling Method for Bursty and Stochastic Requests on Heterogeneous Cloud Servers
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4547–4562https://doi.org/10.1109/TPDS.2022.3196475In this article, we consider the dynamic allocation of bursty requests stochastically arriving at heterogeneous servers with uncertain setup times. Lower expected response time and less power consumption are desirable objectives of users and service ...
- research-articleDecember 2022
Theoretical Analysis of an Adaptive Periodic Multi Installment Scheduling With Result Retrieval for SAR Image Processing
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4672–4683https://doi.org/10.1109/TPDS.2022.3194542Processing a large-scale Synthetic Aperture Radar (SAR) image dataset on a distributed computing infrastructure poses a challenging problem. Large-scale load distribution strategies like multi-installment scheduling (MIS) assume that the size of the ...
- research-articleDecember 2022
Automated Scheduling Algorithm Selection and Chunk Parameter Calculation in OpenMP
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4383–4394https://doi.org/10.1109/TPDS.2022.3189270Increasing node and cores-per-node counts in supercomputers render scheduling and load balancing critical for exploiting parallelism. OpenMP applications can achieve high performance via careful selection of scheduling <monospace>kind</monospace> and <...
- research-articleDecember 2022
<italic>Eiffel</italic>: Efficient and Fair Scheduling in Adaptive Federated Learning
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4282–4294https://doi.org/10.1109/TPDS.2022.3187365Emerging machine learning (ML) technologies, in combination with the increasing computational power of mobile devices, lead to the extensive adoption of ML-based applications. Different from conventional model training that needs to collect all the user ...
- research-articleDecember 2022
PushBox: Making Use of Every Bit of Time to Accelerate Completion of Data-Parallel Jobs
- Chen Tian,
- Yi Wang,
- Bingchuan Tian,
- Yang Zhao,
- Yuhang Zhou,
- Chenxu Wang,
- Haoran Guan,
- Wanchun Dou,
- Guihai Chen
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4256–4269https://doi.org/10.1109/TPDS.2022.3182037To minimize a job's completion time, we need to minimize the completion time of its final stage's last task. Scheduling of machine slots and networks largely dominates the variable part of each task's duration. Finding an optimal ...
- research-articleDecember 2022
Energy-Aware Non-Preemptive Task Scheduling With Deadline Constraint in DVFS-Enabled Heterogeneous Clusters
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4083–4099https://doi.org/10.1109/TPDS.2022.3181096Energy conservation of large data centers for high performance computing workloads, such as deep learning with Big Data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings. This work studies ...
- research-articleDecember 2022
Real-Time Scheduling of Parallel Task Graphs With Critical Sections Across Different Vertices
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 33, Issue 12Dec. 2022, Pages 4117–4133https://doi.org/10.1109/TPDS.2022.3179328All existing work on real-time scheduling of parallel task graph models with shared resources assumes that a critical section must be contained inside a single vertex. However, this assumption does not hold in many realistic parallel real-time software. ...