skip to main content
10.5555/1182635.1164180acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

On biased reservoir sampling in the presence of stream evolution

Published: 01 September 2006 Publication History

Abstract

The method of reservoir based sampling is often used to pick an unbiased sample from a data stream. A large portion of the unbiased sample may become less relevant over time because of evolution. An analytical or mining task (eg. query estimation) which is specific to only the sample points from a recent time-horizon may provide a very inaccurate result. This is because the size of the relevant sample reduces with the horizon itself. On the other hand, this is precisely the most important case for data stream algorithms, since recent history is frequently analyzed. In such cases, we show that an effective solution is to bias the sample with the use of temporal bias functions. The maintenance of such a sample is non-trivial, since it needs to be dynamically maintained, without knowing the total number of points in advance. We prove some interesting theoretical properties of a large class of memory-less bias functions, which allow for an efficient implementation of the sampling algorithm. We also show that the inclusion of bias in the sampling process introduces a maximum requirement on the reservoir size. This is a nice property since it shows that it may often be possible to maintain the maximum relevant sample with limited storage requirements. We not only illustrate the advantages of the method for the problem of query estimation, but also show that the approach has applicability to broader data mining problems such as evolution analysis and classification.

References

[1]
{1} C. Aggarwal, J. Han, J. Wang, and P. Yu. A Framework for High Dimensional Projected Clustering of Data Streams. VLDB Conference Proceedings, pp. 852-863, 2004.
[2]
{2} C. Aggarwal, and P. Yu. A Survey of Synopsis Construction Methods in Data Streams. Data Streams: Models and Algorithms, Springer, ed. C. Aggarwal, to appear 2007.
[3]
{3} B. Babcock, M. Datar, and R. Motwani. Sampling from a Moving Window over Streaming Data. SODA Conference Proceedings, pp. 633-634, 2002.
[4]
{4} S. Babu, and J. Widom. Continuous queries over data streams. ACM SIGMOD Record Archives, 30(3):109-120, 2001.
[5]
{5} E. Cohen, and M. Strauss. Maintaining Time-Decaying Stream Aggregates. ACM PODS Conference Proceedings, pp. 223-233, 2003.
[6]
{6} M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. SODA Conference Proceedings, pp. 635-644, 2002.
[7]
{7} P. Gibbons, and Y. Mattias. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. ACM SIGMOD Conference Proceedings, pp. 331-342, 1998.
[8]
{8} P. Gibbons. Distinct sampling for highly accurate answers to distinct value queries and event reports. VLDB Conference Proceedings, pp. 541-550, 2001.
[9]
{9} A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. One-Pass Wavelet Decompositions of Data Streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):541-554, May 2003.
[10]
{10} P. Karras, and N. Mamoulis. One-pass wavelet synopsis for maximum error metrics. VLDB Conference Proceedings, pp. 421-432, 2005.
[11]
{11} Y. Matias, J. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. ACM SIGMOD Conference Proceedings, pp. 448-459, 1998.
[12]
{12} J. Hellerstein, P. Haas, and H. Wang. Online Aggregation. ACM SIGMOD Conference Proceedings, pp. 171-182, 1997.
[13]
{13} G. Manku, S. Rajagopalan, and B. Lindsay. Random Sampling for Space Efficient Computation of order statistics in large datasets. ACM SIGMOD Conference Proceedings, pp. 251-262, 1999.
[14]
{14} G. Manku, and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB Conference Proceedings, pp. 346-357, 2002.
[15]
{15} N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. ACM SIGMOD Conference Proceedings, pp. 428-439, 2002.
[16]
{16} J. S. Vitter. Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11(1):37-57, March 1985.
[17]
{17} Y. Wu, D. Agrawal, and A. Abbadi. Applying the Golden Rule of Sampling for Query Estimation. ACM SIGMOD Conference Proceedings pp. 449-460, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages

Sponsors

  • SIGMOD: ACM Special Interest Group on Management of Data
  • K.I.S.S. SIG on Databases
  • AJU Information Technology Co., Ltd
  • US Army ITC-PAC Asian Research Office
  • Google Inc.
  • The Database Society of Japan
  • Samsung SOS
  • Advanced Information Technology Research Center
  • Naver
  • Microsoft: Microsoft
  • Korea Info Sci Society: Korea Information Science Society
  • SK telecom
  • Systems Applications Products
  • ORACLE: ORACLE
  • International Business Management
  • Air Force Office of Scientific Research/Asian Office of Aerospace R&D
  • Kosef
  • Kaist
  • LG Electronics
  • CCF-DBS

Publisher

VLDB Endowment

Publication History

Published: 01 September 2006

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media