Rod Lim

California Institute of Technology, Biology, Post-Doc

Followers

Following

Co-authors

Public Views

InterestsView All (6)

Uploads

Papers by Rod Lim

D-factor

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, 2012

Download

Migration, Assignment, and Scheduling of Jobs in Virtualized Environment

Download

Graph mining meets the Semantic Web

2015 31st IEEE International Conference on Data Engineering Workshops, 2015

Download

Graph Processing Platforms at Scale: Practices and Experiences

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015

ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, suc... more ABSTRACT Graphs analysis unveils hidden associations of data in many phenomena and artifacts, such as road network, social networks, genomic information, and scientific collabo- ration. Unfortunately, a wide diversity in the characteristics of graphs and graph operations make it challenging to find a right combination of tools and implementation of algorithms to discover desired knowledge from the target data set. This study presents an extensive empirical study of three representative graph processing platforms: Pegasus, GraphX, and Urika. Each system represents a combination of options in data model, processing paradigm, and infrastructure. We benchmarked each platform using three popular graph operations, degree distribution, connected components, and PageRank over a variety of real-world graphs. Our experiments show that each graph processing platform has advantages for different sets of operations on graphs. While Urika performs the best in finding statistical metric of a graph like degree distribution, GraphX outputforms algorithmic operations like connected components and PageRank. In addition, we discuss challenges to optimize the performance of each platform.

주식 투자 추천 시스템을 위한 효율적인 저장 구조

The KIPS Transactions:PartD, 2009

Download

The information diffusion model in the blog world

Proceedings of the 3rd Workshop on Social Network Mining and Analysis, 2009

In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and sc... more In the blog network, the posts in a blog can be diffused to other blogs through trackbacks and scraps. Analyzing information diffusion in the blog network is an important research issue that can be used for predicting information diffusion, detecting abnormality, marketing, and revitalizing the blog world. Existing studies on information diffusion in a blog network define explicit relationships between

Analyzing reliability of virtual machine instances with dynamic pricing in the public cloud

This study presents reliability analysis of virtual machine instances in public cloud environment... more This study presents reliability analysis of virtual machine instances in public cloud environments in the face of dynamic pricing. Different from traditional fixed pricing, dynamic pricing allows price to dynamically fluctuate over arbitrary period of time according to external factors such as supply and demand, excess capacity, etc. This pricing option introduces a new type of fault: virtual machine instances may be unexpectedly terminated due to conflicts in the original bid price and the current offered price. This new class of fault under dynamic pricing may be more dominant than traditional faults in cloud computing environments, where resource availability associated with traditional faults is often above 99.9%. To address and understand this new type of fault, we translated two classic reliability metrics, mean time between failures and availability, to the Amazon Web Services spot market using historical price data. We also validated our findings by submitting actual bids in the spot market. We found that overall, our historical analysis and experimental validation lined up well. Based upon these experimental results, we also provided suggestions and techniques to maximize overall reliability of virtual machine instances under dynamic pricing.

MDCSim: A multi-tier data center simulation, platform

2009 IEEE International Conference on Cluster Computing and Workshops, 2009

Download

Clock-like Flow Replacement Schemes for Resilient Flow Monitoring

2009 29th IEEE International Conference on Distributed Computing Systems, 2009

Download

A dynamic energy management scheme for multi-tier data centers

(IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2011

Download

MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters

2012 IEEE Fifth International Conference on Cloud Computing, 2012

Download

Performance Implications from Sizing a VM on Multi-core Systems: A Data Analytic Application's View

by Edmon Begoli and Rod Lim

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013

ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applicat... more ABSTRACT In this paper, we present a quantitative performance analysis of data analytics applications running on multi-core virtual machines. Such environments form the core of cloud computing. In addition, data analytics applications, such as Cassandra and Hadoop, are becoming increasingly popular on cloud computing platforms. This convergence necessitates a better understanding of the performance and cost implications of such hybrid systems. For example, the very first step in hosting applications in virtualized environments, requires the user to configure the number of virtual processors and the size of memory. To understand performance implications of this step, we benchmarked three Yahoo Cloud Serving Benchmark(YCSB) workloads in a virtualized multi-core environment. Our measurements indicate that the performance of Cassandra for YCSB workloads does not heavily depend on the processing capacity of a system, while the size of the data set is critical to performance relative to allocated memory. We also identified a strong relationship between the running time of workloads and various hardware events (last level cache loads, misses, and CPU migrations). From this analysis, we provide several suggestions to improve the performance of data analytics applications running on cloud computing environments.

Using multiple indexes for efficient subsequence matching in time-series databases

Information Sciences, 2007

Time-series subsequence matching is an operation that searches for such data subsequences whose c... more Time-series subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We claim that index interpolation is a fairly effective tool to resolve this problem. Index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their distinct sizes. For index interpolation, we need to decide the sizes of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of pairs 〈length, frequency 〉 of query sequences to be performed in a target application and a set of window sizes for building multiple indexes, we devise a formula that estimates the overall cost of all the subsequence matchings. By using this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of experiments with a real-life stock data set and a large volume of a synthetic data set to show the superiority of our approach.