An algebraic approach to rule-based information extraction

F Reiss, S Raghavan, R Krishnamurthy… - 2008 IEEE 24th …, 2008 - ieeexplore.ieee.org
F Reiss, S Raghavan, R Krishnamurthy, H Zhu, S Vaithyanathan
2008 IEEE 24th International Conference on Data Engineering, 2008ieeexplore.ieee.org
Traditional approaches to rule-based information extraction (IE) have primarily been based
on regular expression grammars. However, these grammar-based systems have difficulty
scaling to large data sets and large numbers of rules. Inspired by traditional database
research, we propose an algebraic approach to rule-based IE that addresses these
scalability issues through query optimization. The operators of our algebra are motivated by
our experience in building several rule-based extraction programs over diverse data sets …
Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.
ieeexplore.ieee.org