Efficient mining of maximal sequential patterns using multiple samples

C Luo, SM Chung - Proceedings of the 2005 SIAM International …, 2005 - SIAM
C Luo, SM Chung
Proceedings of the 2005 SIAM International Conference on Data Mining, 2005SIAM
In this paper, we propose a new algorithm, named MSPX, which mines maximal sequential
patterns by using multiple samples to effectively exclude infrequent candidates. MSPX
begins with a bottom-up search. But at each pass, instead of processing all candidates, it
always tries to find most of the infrequent ones effectively by counting only the potentially
infrequent candidates against the whole database. After removing verified infrequent
candidates, the remaining candidates are used to generate new candidates. Finally, with a …
Abstract
In this paper, we propose a new algorithm, named MSPX, which mines maximal sequential patterns by using multiple samples to effectively exclude infrequent candidates. MSPX begins with a bottom-up search. But at each pass, instead of processing all candidates, it always tries to find most of the infrequent ones effectively by counting only the potentially infrequent candidates against the whole database. After removing verified infrequent candidates, the remaining candidates are used to generate new candidates. Finally, with a top-down search, all the maximal frequent sequences can be identified efficiently. Sampling technique is used at each pass to distinguish potentially infrequent candidates. How to increase the minimum support level for the mining of samples to estimate if candidates could be infrequent is analyzed theoretically. Due to the supersequence frequence based pruning, MSPX reduces much more search space than other algorithms. Unlike the traditional single-sample methods proposed for mining frequent itemsets, MSPX uses multiple samples. Thus, it can avoid or alleviate some problems inherent in the single-sample methods. Our experiments show MSPX has very good performance and better scalability than other algorithms. Moreover, even though MSPX uses sampling, the variance of its performance is very small in multiple runs for the same task.
Society for Industrial and Applied Mathematics