Efficient Substructure Discovery from Large Semi-Structured Data

Tatsuya ASAI
Kenji ABE
Shinji KAWASOE
Hiroshi SAKAMOTO
Hiroki ARIMURA
Setsuo ARIKAWA

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E87-D    No.12    pp.2754-2763
Publication Date: 2004/12/01
Online ISSN: 
DOI: 
Print ISSN: 0916-8532
Type of Manuscript: PAPER
Category: Data Mining
Keyword: 
Web mining,  semi-structured data,  association rule mining,  itemset enumeration tree,  labeled ordered trees,  data mining algorithms,  

Full Text: PDF(419.3KB)>>
Buy this Article



Summary: 
In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.


open access publishing via