2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015
ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, suc... more ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, such as road network, social networks, genomic information, and scientific collabo- ration. Unfortunately, a wide diversity in the characteristics of graphs and graph operations make it challenging to find a right combination of tools and implementation of algorithms to discover desired knowledge from the target data set. This study presents an extensive empirical study of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmarked each platform using three popular graph operations, degree distribution, connected components, and PageRank over a variety of real-world graphs. Our experiments show that each graph processing platform has advantages for different sets of operations on graphs. While Urika performs the best in finding statistical metric of a graph like degree distribution, GraphX outputforms algorithmic operations like connected components and PageRank. In addition, we discuss challenges to optimize the performance of each platform.
Proceedings of the 3rd Workshop on Social Network Mining and Analysis, 2009
In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and sc... more In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and scraps. Analyzing information diffusion in the blog network is an important research issue that can be used for predicting information diffusion, detecting abnormality, marketing, and revitalizing the blog world. Existing studies on information diffusion in a blog network define explicit relationships between
This study presents reliability analysis of virtual machine instances in public cloud environment... more This study presents reliability analysis of virtual machine instances in public cloud environments in the face of dynamic pricing. Different from traditional fixed pricing, dynamic pricing allows price to dynamically fluctuate over arbitrary period of time according to external factors such as supply and demand, excess capacity, etc. This pricing option introduces a new type of fault: virtual machine instances may be unexpectedly terminated due to conflicts in the original bid price and the current offered price. This new class of fault under dynamic pricing may be more dominant than traditional faults in cloud computing environments, where resource availability associated with traditional faults is often above 99.9%. To address and understand this new type of fault, we translated two classic reliability metrics, mean time between failures and availability, to the Amazon Web Services spot market using historical price data. We also validated our findings by submitting actual bids in the spot market. We found that overall, our historical analysis and experimental validation lined up well. Based upon these experimental results, we also provided suggestions and techniques to maximize overall reliability of virtual machine instances under dynamic pricing.
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013
ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applicat... more ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very first step in hosting applications in virtualized environments, requires the user to configure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloud Serving Benchmark(YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identified a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.
Time-series subsequence matching is an operation that searches for such data subsequences whose c... more Time-series subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We claim that index interpolation is a fairly effective tool to resolve this problem. Index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their distinct sizes. For index interpolation, we need to decide the sizes of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of pairs 〈length, frequency 〉 of query sequences to be performed in a target application and a set of window sizes for building multiple indexes, we devise a formula that estimates the overall cost of all the subsequence matchings. By using this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of experiments with a real-life stock data set and a large volume of a synthetic data set to show the superiority of our approach.
2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015
ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, suc... more ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, such as road network, social networks, genomic information, and scientific collabo- ration. Unfortunately, a wide diversity in the characteristics of graphs and graph operations make it challenging to find a right combination of tools and implementation of algorithms to discover desired knowledge from the target data set. This study presents an extensive empirical study of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmarked each platform using three popular graph operations, degree distribution, connected components, and PageRank over a variety of real-world graphs. Our experiments show that each graph processing platform has advantages for different sets of operations on graphs. While Urika performs the best in finding statistical metric of a graph like degree distribution, GraphX outputforms algorithmic operations like connected components and PageRank. In addition, we discuss challenges to optimize the performance of each platform.
Proceedings of the 3rd Workshop on Social Network Mining and Analysis, 2009
In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and sc... more In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and scraps. Analyzing information diffusion in the blog network is an important research issue that can be used for predicting information diffusion, detecting abnormality, marketing, and revitalizing the blog world. Existing studies on information diffusion in a blog network define explicit relationships between
This study presents reliability analysis of virtual machine instances in public cloud environment... more This study presents reliability analysis of virtual machine instances in public cloud environments in the face of dynamic pricing. Different from traditional fixed pricing, dynamic pricing allows price to dynamically fluctuate over arbitrary period of time according to external factors such as supply and demand, excess capacity, etc. This pricing option introduces a new type of fault: virtual machine instances may be unexpectedly terminated due to conflicts in the original bid price and the current offered price. This new class of fault under dynamic pricing may be more dominant than traditional faults in cloud computing environments, where resource availability associated with traditional faults is often above 99.9%. To address and understand this new type of fault, we translated two classic reliability metrics, mean time between failures and availability, to the Amazon Web Services spot market using historical price data. We also validated our findings by submitting actual bids in the spot market. We found that overall, our historical analysis and experimental validation lined up well. Based upon these experimental results, we also provided suggestions and techniques to maximize overall reliability of virtual machine instances under dynamic pricing.
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013
ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applicat... more ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very first step in hosting applications in virtualized environments, requires the user to configure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloud Serving Benchmark(YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identified a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.
Time-series subsequence matching is an operation that searches for such data subsequences whose c... more Time-series subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We claim that index interpolation is a fairly effective tool to resolve this problem. Index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their distinct sizes. For index interpolation, we need to decide the sizes of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of pairs 〈length, frequency 〉 of query sequences to be performed in a target application and a set of window sizes for building multiple indexes, we devise a formula that estimates the overall cost of all the subsequence matchings. By using this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of experiments with a real-life stock data set and a large volume of a synthetic data set to show the superiority of our approach.
Uploads
Papers by Rod Lim