research-article

Designing an inductive data stream management system: the stream mill experience

Authors:

Barzan Mozafari,

Carlo ZanioloAuthors Info & Claims

SSPS '08: Proceedings of the 2nd international workshop on Scalable stream processing system

Pages 79 - 88

https://doi.org/10.1145/1379272.1379286

Published: 29 March 2008 Publication History

Abstract

There has been much recent interest in on-line data mining. Existing mining algorithms designed for stored data are either not applicable or not effective on data streams, where real-time response is often needed and data characteristics change frequently. Therefore, researchers have been focusing on designing new and improved algorithms for on-line mining tasks, such as classification, clustering, frequent itemsets mining, pattern matching, etc. Relatively little attention has been paid to designing DSMSs, which facilitate and integrate the task of mining data streams---i.e., stream systems that provide Inductive functionalities analogous to those provided by Weka and MS OLE DB for stored data. In this paper, we propose the notion of an Inductive DSMS---a system that besides providing a rich library of inter-operable functions to support the whole mining process, also supports the essentials of DSMS, including optimization of continuous queries, load shedding, synoptic constructs, and non-stop computing. Ease-of-use and extensibility are additional desiderata for the proposed Inductive DSMS. We first review the many challenges involved in realizing such a system and then present our approach of extending the Stream Mill DSMS toward that goal. Our system features (i) a powerful query language where mining methods are expressed via aggregates for generic streams and arbitrary windows, (ii) a library of fast and light mining algorithms, and (iii) an architecture that makes it easy to customize and extend existing mining methods and introduce new ones.

References

[1]

Atlas user manual. http://wis.cs.ucla.edu/atlas.]]

[2]

DB2 Universal Database http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp.]]

[3]

Decision Tree Entropy Calculation http://decisiontrees.net/? q=node/27.]]

[4]

IBM. DB2 Intelligent Miner http://www-306.ibm.com/software/data/iminer.]]

[5]

ORACLE. Oracle Data Miner Release 10gr2 http://www.oracle.com/technology/products/bi/odm.]]

[6]

A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queries over streams and relations. In DBPL, pages 1--19, 2003.]]

[7]

Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336--347, 2004.]]

Digital Library

[8]

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.]]

Digital Library

[9]

Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo. A data stream language and system designed for power and extensibility. In CIKM, pages 337--346, 2006.]]

Digital Library

[10]

Toon Calders, Bart Goethals, and Adriana Prado. Integrating pattern mining in relational databases. In PKDD, volume 4213 of Lecture Notes in Computer Science, pages 454--461. Springer, 2006.]]

[11]

W. Cheung and O. R. Zaiane. Incremental mining of frequent patterns without candidate generation or support. In DEAS, 2003.]]

[12]

Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM '04), November 2004.]]

Digital Library

[13]

F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD, volume 3056, 2004.]]

[14]

Weka 3: data mining with open source machine learning software in java. http://www.cs.waikato.ac.nz.]]

[15]

Guozhu Dong, Jiawei Han, Laks V. S. Lakshmanan, Jian Pei, Haixun Wang, and Philip S. Yu. Online mining of changes from data streams: Research problems and preliminary results. In SIGMOD, 2003.]]

[16]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, and Xiaowei Xu. Incremental clustering for mining in a data warehousing environment. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 323--333, 1998.]]

Digital Library

[17]

Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining, pages 226--231, 1996.]]

Digital Library

[18]

C. Jin et al. Dynamically maintaining frequent items over a data stream. In CIKM, 2003.]]

Digital Library

[19]

D. Abadi et al. Aurora: A new model and architecture for data stream management. VLDB Journal, 12(2):120--139, 2003.]]

Digital Library

[20]

Sirish Chandrasekaran et al. Telegraphcq: Continuous dataflow processing for an uncertain world. In CIDR, 2003.]]

Digital Library

[21]

Stream Mill Examples. Approximate Frequent Items http://wis.cs.ucla.edu/stream-mill/examples/freq.html.]]

[22]

E. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, page 768, 1965.]]

[23]

George Forman. Tackling concept drift by temporal inductive transfer. In SIGIR, pages 252--259, 2006.]]

Digital Library

[24]

J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A data mining query language for relational databases. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), pages 27--33, Montreal, Canada, June 1996.]]

[25]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.]]

Digital Library

[26]

T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373--408, 1999.]]

Digital Library

[27]

Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):58--64, 1996.]]

Digital Library

[28]

Nan Jiang and Le Gruenwald. Research issues in data stream association rule mining. SIGMOD Record, 35(1): 14--19, 2006.]]

Digital Library

[29]

Minsoo Kim, Jae-Hyun Seo, II-Ahn Cheong, and Bong-Nam Noh. Fuzzy Systems and Knowledge Discovery, chapter Auto-generation of Detection Rules with Tree Induction Algorithm, pages 160--169. Springer Berlin / Heidelberg, 2005.]]

Digital Library

[30]

Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB, pages 492--503, 2004.]]

Digital Library

[31]

C. K.-S. Leung, Q. I. Khan, and T. Hoque. Cantree: A tree structure for efficient incremental mining of frequent patterns. In ICDM, 2005.]]

Digital Library

[32]

C. Luo, H. Thakkar, H. Wang, and C. Zaniolo. A native extension of sql for mining data streams. In SIGMOD, pages 873--875, 2005.]]

Digital Library

[33]

R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In VLDB, pages 122--133, Bombay, India, 1996.]]

Digital Library

[34]

Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE), 2008.]]

Digital Library

[35]

Chang-Shing Perng and D. S. Parker. SQL/LPP: A time series extension of SQL based on limited patience patterns. In DEXA, volume 1677 of Lecture Notes in Computer Science. Springer, 1999.]]

Digital Library

[36]

R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. Srql: Sorted relational query language, 1998.]]

[37]

Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001.]]

Digital Library

[38]

S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998.]]

Digital Library

[39]

Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In Richard T. Snodgrass and Marianne Winslett, editors, SIGMOD, pages 430--141. ACM Press, 1994.]]

Digital Library

[40]

A. Siebes. Where is the mining in kdid? (invited talk). In Fourth Int. Workshop on Knowledge Discovery in Inductive Databases, 2005.]]

[41]

M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.]]

Digital Library

[42]

Z. Tang, J. Maclennan, and P. Kim. Building data mining solutions with OLE DB for DM and XML analysis. SIGMOD Record, 34(2):80--85, 2005.]]

Digital Library

[43]

H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In SIGKDD, 2003.]]

Digital Library

[44]

Carlo Zaniolo. Mining databases and data streamswith query languages and rules (invited talk). In Fourth Int. Workshop on Knowledge Discovery in Inductive Databases, 2005.]]

Digital Library

[45]

Fred Zemke, Andrew Witkowski, Mitch Cherniak, and Latha Colby. Pattern matching in sequences of rows. Technical report, Oracle and IBM, 2007.]]

Cited By

Michael PTsanakas PParker D(2022)Blue Danube: A Large-Scale, End-to-End Synchronous, Distributed Data Stream Processing Architecture for Time-Sensitive Applications2022 IEEE/ACM 26th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)10.1109/DS-RT55542.2022.9932034(39-48)Online publication date: 26-Sep-2022
https://doi.org/10.1109/DS-RT55542.2022.9932034
Laptev NMozafari BMousavi HThakkar HWang HZeng KZaniolo C(2016)Extending Relational Query Languages for Data StreamsData Stream Management10.1007/978-3-540-28608-0_18(361-386)Online publication date: 12-Jul-2016
https://doi.org/10.1007/978-3-540-28608-0_18
Michelsen TBarcelo PChen L(2014)Data stream processing in dynamic and decentralized peer-to-peer networksProceedings of the 2014 SIGMOD PhD symposium10.1145/2602622.2602629(1-5)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2602622.2602629
Show More Cited By

Index Terms

Designing an inductive data stream management system: the stream mill experience

Recommendations

A Data Stream Mining System
ICDMW '08: Proceedings of the 2008 IEEE International Conference on Data Mining Workshops

On-line data stream mining has attracted much research interest, but systems that can be used as a workbench for online mining have not been researched, since they pose many difficult research challenges. The proposed system addresses these challenges ...
Data Stream Management
Supporting knowledge discovery in data stream management systems

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSPS '08: Proceedings of the 2nd international workshop on Scalable stream processing system

March 2008

99 pages

ISBN:9781595939630

DOI:10.1145/1379272

Conference Chair:
Byung S. Lee
University of Vermont

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EDBT '08

EDBT '08: 11th International Conference on Extending Database Technology

March 29, 2008

Nantes, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
410
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Michael PTsanakas PParker D(2022)Blue Danube: A Large-Scale, End-to-End Synchronous, Distributed Data Stream Processing Architecture for Time-Sensitive Applications2022 IEEE/ACM 26th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)10.1109/DS-RT55542.2022.9932034(39-48)Online publication date: 26-Sep-2022
https://doi.org/10.1109/DS-RT55542.2022.9932034
Laptev NMozafari BMousavi HThakkar HWang HZeng KZaniolo C(2016)Extending Relational Query Languages for Data StreamsData Stream Management10.1007/978-3-540-28608-0_18(361-386)Online publication date: 12-Jul-2016
https://doi.org/10.1007/978-3-540-28608-0_18
Michelsen TBarcelo PChen L(2014)Data stream processing in dynamic and decentralized peer-to-peer networksProceedings of the 2014 SIGMOD PhD symposium10.1145/2602622.2602629(1-5)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2602622.2602629
Badiozamany SMelander LTruong TCheng XRisch TChakravarthy SUrban SPietzuch PRundensteiner E(2013)Grand challengeProceedings of the 7th ACM international conference on Distributed event-based systems10.1145/2488222.2488284(325-330)Online publication date: 29-Jun-2013
https://dl.acm.org/doi/10.1145/2488222.2488284
Geisler SQuix CSchiffer SJarke M(2012)An evaluation framework for traffic information systems based on data streamsTransportation Research Part C: Emerging Technologies10.1016/j.trc.2011.08.00323(29-55)Online publication date: Aug-2012
https://doi.org/10.1016/j.trc.2011.08.003
Thakkar HLaptev NMousavi HMozafari BRusso VZaniolo C(2011)SMMProceedings of the 2011 IEEE 27th International Conference on Data Engineering10.1109/ICDE.2011.5767879(757-768)Online publication date: 11-Apr-2011
https://dl.acm.org/doi/10.1109/ICDE.2011.5767879
Mozafari BZaniolo C(2010)Optimal load shedding with aggregates and mining queries2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447867(76-88)Online publication date: Mar-2010
https://doi.org/10.1109/ICDE.2010.5447867
Michael PParker Jr. D(2009)The Semantic Space-Time Models of the Streamonas Data Stream Management SystemProceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 0410.1109/CSIE.2009.497(82-90)Online publication date: 31-Mar-2009
https://dl.acm.org/doi/10.1109/CSIE.2009.497
Thakkar HMozafari BZaniolo C(2008)A Data Stream Mining SystemProceedings of the 2008 IEEE International Conference on Data Mining Workshops10.1109/ICDMW.2008.133(987-990)Online publication date: 15-Dec-2008
https://dl.acm.org/doi/10.1109/ICDMW.2008.133

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents