skip to main content
10.1145/543613.543615acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Models and issues in data stream systems

Published: 03 June 2002 Publication History

Abstract

In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

References

[1]
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 487-498, May 2000.
[2]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 275-286, June 1999.
[3]
M. Ajtai, T. Jayram, R. Kumar, and D. Sivakumar. Counting inversions in a data stream. manuscript, 2001.
[4]
N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proc. of the 1999 ACM Symp. on Principles of Database Systems, pages 10-20, 1999.
[5]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. of the 1996 Annual ACM Symp. on Theory of Computing, pages 20-29, 1996.
[6]
M. Altinel and M. J. Franklin. Efficient filtering of XML documents for selective dissemination of information. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 53-64, Sept. 2000.
[7]
A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. Characterizing memory requirements for queries over continuous data streams. In Proc. of the 2002 ACM Symp. on Principles of Database Systems, June 2002. Available at http://dbpubs.stanford.edu/pub/2001-49.
[8]
R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 261-272, May 2000.
[9]
B. Babcock, M. Datar, and R. Motwani, Sampling from a moving window over streaming data. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 633-634, 2002.
[10]
S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 30(3):109-120, Sept. 2001.
[11]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: Lower bounds and applications. In Proc. of the 2001 Annual ACM Symp. on Theory of Computing, pages 266-275, 2001.
[12]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 623-632, 2002.
[13]
S. Bellamkonda, T. Borzkaya, B. Ghosh, A. Gupta, J. Haydu, S. Subramanian, and A. Witkowski. Analytic functions in oracle 8i. Available at http://www-db.stanford.edu/dbseminar/Archive/SpringY2000/speakers/agupta/paper.pdf.
[14]
J. A. Blakeley, N. Coburn, and P. A. Larson. Updating derived relations: Detecting irrelevant and autonomously computable updates. ACM Trans. on Database Systems, 14(3):369-400, 1989.
[15]
D. Carney, U. Cetinternel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams --- a new class of dbms applications. Technical Report CS-02-01, Department of Computer Science, Brown University, Feb. 2002.
[16]
K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In Proc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 111-122, Sept. 2000.
[17]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. of the 2000 ACM Symp. on Principles of Database Systems, pages 268-279, 2000.
[18]
S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 295-306, May 2001.
[19]
S. Chaudhuri and R. Motwani. On sampling and relational operators. Bulletin of the Technical Committee on Data Engineering, 22:35-40, 1999.
[20]
S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 436-447, 1998.
[21]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 263-274, June 1999.
[22]
S. Chaudhuri and V. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 146-155, 1997.
[23]
J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagraCQ: A scalable continuous query system for internet databases. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 379-390, May 2000.
[24]
C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock: a language for extracting signatures from data streams. In Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 9-17, Aug. 2000.
[25]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 635-644, 2002.
[26]
A. Dobra, J. Gehrke, M. Garofalakis, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, 2002.
[27]
P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 71-80, Aug. 2000.
[28]
P. Domingos, G. Hulten, and L. Spencer. Mining time-changing data streams. In Proc. of the 2001 ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 97-106, 2001.
[29]
N. Duffield and M. Grossglauser. Trajectory sampling for direct traffic observation. In Proc. of the 2000 ACM SIGCOMM, pages 271-284, Sept. 2000.
[30]
D. B. et al. The New Jersey data reduction report. IEEE Data Engineering Bulletin, 20(4):3-45, 1997.
[31]
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. of the 1994 ACM SIGMOD Intl. Conf. on Management of Data, pages 419-429, May 1994.
[32]
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 299-310, 1998.
[33]
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate 11-difference algorithm for massive data streams. In Proc. of the 1999 Annual IEEE Symp. on Foundations of Computer Science, pages 501-511, 1999.
[34]
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot checking of data streams. In Proc. of the 2000 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 165-174, 2000.
[35]
P. Flajolet and G. Martin. Probabilistic counting. In Proc. of the 1983 Annual IEEE Symp. on Foundations of Computer Science, 1983.
[36]
H. Garcia-Molina, W. Labio, and J. Yang. Expiring data in a warehouse. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 500-511, Aug. 1998.
[37]
J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 13-24, May 2001.
[38]
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. of the 2001 ACM Symp. on Parallel Algorithms and Architectures, pages 281-291, 2001.
[39]
A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In Proc. of the 2002 Annual ACM Symp. on Theory of Computing, 2002.
[40]
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 79-88, 2001.
[41]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 58-66, 2001.
[42]
S. Guha and N. Koudas. Approximating a data stream for querying and estimation: Algorithms and performance evaluation. In Proc. of the 2002 Intl. Conf. on Data Engineering, 2002.
[43]
S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Proc. of the 2001 Annual ACM Symp. on Theory of Computing, pages 471-475, 2001.
[44]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. of the 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 359-366, Nov. 2000.
[45]
A. Gupta, H. V. Jagadish, and I. S. Mumick. Data integration using self-maintainable views. In Proc. of the 1996 Intl. Conf. on Extending Database Technology, pages 140-144, Mar. 1996.
[46]
P. Haas, J. Naughton, P. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of the 1995 Intl. Conf. on Very Large Data Bases, pages 311-322, Sept. 1995.
[47]
J. Hellerstein, M. Franklin, et al. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7-18, June 2000.
[48]
J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In Proc. of the 1997 ACM SIGMOD Intl. Conf. on Management of Data, pages 171-182, May 1997.
[49]
M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report TR 1998-011, Compaq Systems Research Center, Palo Alto, California, May 1998.
[50]
P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. of the 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 189-197, 2000.
[51]
Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. In Proc. of the 1999 Intl. Conf. on Very Large Data Bases, pages 174-185, Sept. 1999.
[52]
iPolicy Networks home page. http://www.ipolicynetworks.com.
[53]
Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 299-310, June 1999.
[54]
H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In Proc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 275-286, 1998.
[55]
H. Jagadish, I. Mumick, and A. Silberschatz. View maintenance issues for the Chronicle data model. In Proc. of the 1995 ACM Symp. on Principles of Database Systems, pages 113-124, May 1995.
[56]
E. Kushlevitz and N. Nisan. Communication Complexity. Cambridge University Press, 1997.
[57]
L. Liu, C. Pu, and W. Tang. Continual queries for internet scale event-driven information delivery. IEEE Trans. on Knowledge and Data Engineering, 11(4):583-590, Aug. 1999.
[58]
S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proc. of the 2002 Intl. Conf. on Data Engineering, Feb. 2002. (To appear).
[59]
S. Madden, J. Hellerstein, M. Shah, and V. Raman. Continuously adaptive continuous queries over streams. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (To appear).
[60]
G. Manku and R. Motwani. Approximate frequency counts over streaming data. manuscript, 2002.
[61]
G. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 426-435, June 1998.
[62]
G. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 251-262, June 1999.
[63]
Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 448-459, June 1998.
[64]
Y. Matias, J. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. In Proc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 101-110, Sept. 2000.
[65]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[66]
J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12:315-323, 1980.
[67]
B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 437-448, May 2001.
[68]
V. Poosala and V. Ganti. Fast approximate answers to aggregate queries on a data cube. In Proc. of the 1999 Intl. Conf. on Scientific and Statistical Database Management, pages 24-33, July 1999.
[69]
D. Quass, A. Gupta, I. Mumick, and J. Widom. Making views self-maintainable for data warehousing. In Proc. of the 1996 Intl. Conf. on Parallel and Distributed Information Systems, pages 158-169, Dec. 1996.
[70]
V. Raman, B. Raman, and J. Hellerstein. Online dynamic reordering for interactive data processing. In Proc. of the 1999 Intl. Conf. on Very Large Data Bases, 1999.
[71]
M. Saks and X. Sun. Space lower bounds for distance approximation in the data stream model. In Proc. of the 2002 Annual ACM Symp. on Theory of Computing, 2002.
[72]
U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan. Alert: An architecture for transforming a passive DBMS into an active DBMS. In Proc. of the 1991 Intl. Conf. on Very Large Data Bases, pages 469-478, Sept. 1991.
[73]
T. K. Sellis. Multiple-query optimization. ACM Trans. on Database Systems, 13(1):23-52, 1988.
[74]
P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. In Proc. of the 1994 ACM SIGMOD Intl. Conf. on Management of Data, pages 430-441, May 1994.
[75]
P. Seshadri, M. Livny, and R. Ramakrishnan. Seq: A model for sequence databases. In Proc. of the 1995 Intl. Conf. on Data Engineering, pages 232-239, Mar. 1995.
[76]
P. Seshadri, M. Livny, and R. Ramakrishnan. The design and implementation of a sequence database system. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 99-110, Sept. 1996.
[77]
J. Shanmugasundaram, K. Tufte, D. J. DeWitt, J. F. Naughton, and D. Maier. Architecting a network query engine for producing partial results. In Proc. of the 2000 Intl. Workshop on the Web and Databases, pages 17-22, May 2000.
[78]
R. Snodgrass and I. Ahn. A taxonomy of time in databases. In Proc. of the 1985 ACM SIGMOD Intl. Conf. on Management of Data, pages 236-245, 1985.
[79]
S.-. Standard. On-line analytical processing (sql/olap). Available from http://www.ansi.org/, document#ISO/IEC9075-2/Amd1:2001.
[80]
Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream.
[81]
M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In Proc. of the 1996 Intl. Conf. on Very Large Data Bases, page 594, Sept. 1996.
[82]
D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In Proc. of the 1992 ACM SIGMOD Intl. Conf. on Management of Data, pages 321-330, June 1992.
[83]
Traderbot home page. http://www.traderbot.com.
[84]
P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Enhancing relational operators for querying over punctuated data streams. manuscript, 2002. Available at http://www.cse.ogi.edu/dot/niagara/pstream/punctuating.pdf.
[85]
J. Ullman and J. Widom. A First Course in Database Systems. Prentice Hall, Upper Saddle River, New Jersey, 1997.
[86]
T. Urhan and M. Franklin. Xjoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, 23(2):27-33, June 2000.
[87]
S. Viglas and J. Naughton. Rate-based query optimization for streaming information sources. In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (To appear).
[88]
J. Vitter. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 11(1):37-57, 1985.
[89]
J. Vitter. External memory algorithms and datastructures. In J. Abello, editor, External Memory Algorithms, pages 1-18. Dimacs, 1999.
[90]
J. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 193-204, June 1999.
[91]
J. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proc. of the 1998 Intl. Conf. on Information and Knowledge Management, Nov. 1998.
[92]
Xml path language (XPath) version 1.0, Nov. 1999. W3C Recommendation available at http://www.w3.org/TR/xpath.
[93]
Yahoo home page. http://www.yahoo.com.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2002
311 pages
ISBN:1581135076
DOI:10.1145/543613
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS02

Acceptance Rates

PODS '02 Paper Acceptance Rate 24 of 109 submissions, 22%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)211
  • Downloads (Last 6 weeks)27
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media