FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams
Abstract
:1. Introduction
- We propose a new method to estimate probability distribution for data streams with concept drift.
- FlexSketch decouples adapting to concept drift from adjusting the statistical model for stationary data by incorporating two separate operations.
- FlexSketch achieves low computational overhead and high throughput, which are critical for processing of stream data, using an ensemble of compact histograms.
2. Related Work
3. Proposed Method
- (a)
- We choose a histogram as the statistical model. When there are only minor changes in the data stream, a histogram is a suitable model since it can be updated at a high speed.
- (b)
- Our method uses an ensemble data structure consisting of several histograms. The ensemble structure can compensate for inaccuracy of a histogram.
- (c)
- We design two adaptation techniques. If the data stream is stationary or there are only minor changes in it, FlexSketch updates the models, i.e., histograms. On the other hand, if there are major changes in the data stream, updating the models may not guarantee sufficient accuracy. To address this issue, we generate a new model that represents the changed data stream and adds it to the data structure.
3.1. Update Operation
Algorithm 1: Update operation. |
Algorithm 2: Build operation. |
3.2. Query Operation
4. Experiments
4.1. Datasets
- (a)
- Sudden concept drift is defined as the case where the distribution of the data stream changes suddenly. It is to test how well a density estimation algorithm forgets old concepts after concept drift occurs. The underlying distribution is a normal distribution whose mean value changes abruptly, i.e., , where for and for . We consider and as shown in Figure 3a.
- (b)
- Incremental concept drift is defined as the case where the distribution of the data stream changes incrementally. It is to test how well a density estimation algorithm adapts to the latest concept. The underlying distribution is a normal distribution whose mean value moves at a constant speed, i.e., , where for and for . We set and , as shown in Figure 3b.
- (c)
- Blip concept drift is defined as the case where the distribution of data stream suddenly changes and returns to the original state in a short time. It is to test how well the estimated PDF remains stable even if an outlier occurs. The underlying distribution is a normal distribution whose mean value changes suddenly and returns, i.e., , where for and otherwise. is the duration of blip concept drift, which is set to . We also set and , as shown in Figure 3c.
4.2. Implementation
4.3. Performance Metrics
- (a)
- Half-life: In order to measure the adaptability of an algorithm under sudden concept drift, we measure the time taken until the error at the time of concept drift is reduced by a half, which is denoted as half-life:This metric basically measures how quickly an old concept is forgotten in the short term.
- (b)
- Lifetime: Similarly, we also quantify how long the contribution of the past data stays, or equivalently, how quickly the old concept is forgotten in the long term. The lifetime is defined as the time required for a long-lived term in (11) to reduce to times its initial value, which is given by
- (c)
- Lag: The lag measures how well the estimated PDF adapts to the data stream under incremental concept drift. It is defined as the absolute ratio of and the derivative of at , which can be obtained byIf an algorithm does not adapt well to the concept drift, the accumulated error makes the algorithm lag behind more and more.
- (d)
- Instability The instability measures how fast the estimated PDF moves for a short duration when blip concept drift occurs. It is defined as the velocity of the error, which can be approximated as
4.4. Throughput
4.5. Accuracy (Error and Adaptability)
4.5.1. Stationary Case
4.5.2. Non-Stationary Case
4.6. Memory Usage
4.7. Effects of Parameters
4.7.1. Stationary Case
4.7.2. Non-Stationary Case
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Kraska, T.; Beutel, A.; Chi, E.H.; Dean, J.; Polyzotis, N. The case for learned index structures. In Proceedings of the International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 489–504. [Google Scholar]
- Ustinova, E.; Lempitsky, V. Learning deep embeddings with histogram loss. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4170–4178. [Google Scholar]
- Geng, Y.; Liu, S.; Yin, Z.; Naik, A.; Prabhakar, B.; Rosenblum, M.; Vahdat, A. Exploiting a natural network effect for scalable, fine-grained clock synchronization. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, Renton, WA, USA, 9–11 April 2018; pp. 81–94. [Google Scholar]
- Webb, G.I.; Hyde, R.; Cao, H.; Nguyen, H.L.; Petitjean, F. Characterizing concept drift. Data Min. Knowl. Discov. 2016, 30, 964–994. [Google Scholar] [CrossRef] [Green Version]
- Ahmad, S.; Lavin, A.; Purdy, S.; Agha, Z. Unsupervised real-time anomaly detection for streaming data. Neurocomputing 2017, 262, 134–147. [Google Scholar] [CrossRef]
- Cheng, K.W.; Chen, Y.T.; Fang, W.H. Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2909–2917. [Google Scholar]
- Yang, D.; Li, B.; Rettig, L.; Cudré-Mauroux, P. HistoSketch: Fast similarity-preserving sketching of streaming histograms with concept drift. In Proceedings of the IEEE International Conference on Data Mining, New Orleans, LA, USA, 18–21 November 2017; pp. 545–554. [Google Scholar]
- Ben-Haim, Y.; Tom-Tov, E. A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 2010, 11, 849–872. [Google Scholar]
- Kristan, M.; Leonardis, A.; Skočaj, D. Multivariate online kernel density estimation with Gaussian kernels. Pattern Recognit. 2011, 44, 2630–2642. [Google Scholar] [CrossRef]
- Heinz, C.; Seeger, B. Towards kernel density estimation over streaming data. In Proceedings of the International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006; pp. 1–12. [Google Scholar]
- Qahtan, A.A.; Wang, S.; Zhang, X. KDE-Track: An efficient dynamic density estimator for data streams. IEEE Trans. Knowl. Data Eng. 2017, 29, 642–655. [Google Scholar] [CrossRef] [Green Version]
- Hill, D.J.; Minsker, B.S. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environ. Model. Softw. 2010, 25, 1014–1022. [Google Scholar] [CrossRef]
- Wu, C.; Jiang, P.; Ding, C.; Feng, F.; Chen, T. Intelligent fault diagnosis of rotating machinery based on one-dimensional convolutional neural network. Comput. Ind. 2019, 108, 53–61. [Google Scholar] [CrossRef]
- Wang, J.; Yang, X.; Long, K. A new relative entropy based app-DDoS detection method. In Proceedings of the IEEE Symposium on Computers and Communications, Riccione, Italy, 22–25 June 2010; pp. 966–968. [Google Scholar]
- Wilson, A.G.; Gilboa, E.; Nehorai, A.; Cunningham, J.P. Fast kernel learning for multidimensional pattern extrapolation. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14)—Volume 2; MIT Press: Cambridge, MA, USA, 2014; pp. 3626–3634. [Google Scholar]
- Pham, D.S.; Venkatesh, S.; Lazarescu, M.; Budhaditya, S. Anomaly detection in large-scale data stream networks. Data Min. Knowl. Discov. 2014, 28, 145–189. [Google Scholar] [CrossRef] [Green Version]
- Gama, J.A.; Žliobaitundefined, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
- Bifet, A.; Holmes, G.; Pfahringer, B.; Kirkby, R.; Gavaldà, R. New ensemble methods for evolving data streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 139–148. [Google Scholar]
- Bifet, A.; Holmes, G.; Pfahringer, B. Leveraging bagging for evolving data streams. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Barcelona, Spain, 19–23 September 2010; pp. 135–150. [Google Scholar]
- Gomes, H.M.; Bifet, A.; Read, J.; Barddal, J.P.; Enembreck, F.; Pfharinger, B.; Holmes, G.; Abdessalem, T. Adaptive random forests for evolving data stream classification. Mach. Learn. 2017, 106, 1469–1495. [Google Scholar] [CrossRef]
- Cano, A.; Krawczyk, B. Kappa updated ensemble for drifting data stream mining. Mach. Learn. 2020, 109, 175–218. [Google Scholar] [CrossRef]
- Klinkenberg, R.; Joachims, T. Detecting concept drift with support vector machines. In Proceedings of the International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 487–494. [Google Scholar]
- Li, B.; Wang, Y.J.; Yang, D.S.; Li, Y.M.; Ma, X.K. FAAD: An unsupervised fast and accurate anomaly detection method for a multi-dimensional sequence over data stream. Front. Inf. Technol. Electron. Eng. 2019, 20, 388–404. [Google Scholar] [CrossRef]
- Bashir, S.; Petrovski, A.; Doolan, D. A framework for unsupervised change detection in activity recognition. Int. J. Pervasive Comput. Commun. 2017, 13, 157–175. [Google Scholar] [CrossRef] [Green Version]
- Sethi, T.; Kantardzic, M. Handling adversarial concept drift in streaming data. Expert Syst. Appl. 2018, 97, 18–40. [Google Scholar] [CrossRef] [Green Version]
- Costa, A.F.J.; Albuquerque, R.A.S.; dos Santos, E.M. A drift detection method based on active learning. In Proceedings of the International Joint Conference on Neural Networks, Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Koh, Y.S. CD-TDS: Change detection in transactional data streams for frequent pattern mining. In Proceedings of the International Joint Conference on Neural Networks, Vancouver, BC, Canada, 24–29 July 2016; pp. 1554–1561. [Google Scholar]
- De Mello, R.; Vaz, Y.; Grossi, C.; Bifet, A. On learning guarantees to unsupervised concept drift detection on data streams. Expert Syst. Appl. 2019, 117, 90–102. [Google Scholar] [CrossRef]
- Pinagé, F.; dos Santos, E.M.; Gama, J. A drift detection method based on dynamic classifier selection. Data Min. Knowl. Discov. 2019, 34, 50–74. [Google Scholar] [CrossRef]
- Bouchachia, A. Fuzzy classification in dynamic environments. Soft Comput. 2011, 15, 1009–1022. [Google Scholar] [CrossRef]
- Gomes, J.A.B.; Menasalvas, E.; Sousa, P.A.C. Learning recurring concepts from data streams with a context-aware ensemble. In Proceedings of the 2011 ACM Symposium on Applied Computing (SAC’11), Taichung, Taiwan, 21–24 March 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 994–999. [Google Scholar]
- Adä, I.; Berthold, M.R. EVE: A framework for event detection. Evol. Syst. 2013, 4, 61–70. [Google Scholar] [CrossRef] [Green Version]
- Vorburger, P.; Bernstein, A. Entropy-based concept shift detection. In Proceedings of the International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006; pp. 1113–1118. [Google Scholar]
- Gözüaçık, O.; Büyükçakır, A.; Bonab, H.; Can, F. Unsupervised concept drift detection with a discriminative classifier. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19), Beijing, China, 3–7 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2365–2368. [Google Scholar]
- Wang, X.; Kang, Q.; An, J.; Zhou, M. Drifted Twitter spam classification using multiscale detection test on K-L divergence. IEEE Access 2019, 7, 108384–108394. [Google Scholar] [CrossRef]
- Prabhu, S.S.; Runger, G.C. Designing a multivariate EWMA control chart. J. Qual. Technol. 1997, 29, 8–15. [Google Scholar] [CrossRef]
- Koren, Y. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 447–456. [Google Scholar]
- Pechenizkiy, M.; Bakker, J.; Žliobaitė, I.; Ivannikov, A.; Kärkkäinen, T. Online mass flow prediction in CFB boilers with explicit detection of sudden concept drift. ACM SIGKDD Explor. Newsl. 2010, 11, 109–116. [Google Scholar] [CrossRef]
- Forman, G. Tackling concept drift by temporal inductive transfer. In Proceedings of the 29th ACM Conference on Research and Development in Information Retrieval, Seattle, WA, USA, 6–11 August 2006; pp. 252–259. [Google Scholar]
- Gilbert, A.C.; Guha, S.; Indyk, P.; Kotidis, Y.; Muthukrishnan, S.; Strauss, M.J. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the Annual ACM Symposium on Theory of Computing, Montreal, QC, Canada, 19–21 May 2002; pp. 389–398. [Google Scholar]
- Guha, S.; Koudas, N.; Shim, K. Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst. 2006, 31, 396–438. [Google Scholar] [CrossRef]
- Greenwald, M.; Khanna, S.; Greenwald, M.; Khanna, S. Space-efficient online computation of quantile summaries. ACM SIGMOD Rec. 2001, 30, 58–66. [Google Scholar] [CrossRef]
- Shrivastava, N.; Buragohain, C.; Agrawal, D.; Suri, S. Medians and beyond: New aggregation techniques for sensor networks. In Proceedings of the International Conference on Embedded Network Sensor Systems, Baltimore, MD, USA, 3–5 November 2004; pp. 239–249. [Google Scholar]
- Cormode, G.; Korn, F.; Muthukrishnan, S.; Srivastava, D. Effective computation of biased quantiles over data streams. In Proceedings of the International Conference on Data Engineering, Tokoyo, Japan, 5–8 April 2005; pp. 1–12. [Google Scholar]
- Singh, S.A.; Srivastava, D.; Tirthapura, S. Estimating quantiles from the union of historical and streaming data. In Proceedings of the VLDB Endowment, New Delhi, India, 9 May 2016; Volume 10, pp. 433–444. [Google Scholar]
- Datar, M.; Gionis, A.; Indyk, P.; Motwani, R. Maintaining stream statistics over sliding windows. SIAM J. Comput. 2002, 31, 1794–1813. [Google Scholar] [CrossRef]
- Kuncheva, L.I.; Žliobaitė, I. On the window size for classification in changing environments. Intell. Data Anal. 2009, 13, 861–872. [Google Scholar] [CrossRef] [Green Version]
- Deypir, M.; Sadreddini, M.H.; Hashemi, S. Towards a variable size sliding window model for frequent itemset mining over data streams. Comput. Ind. Eng. 2012, 63, 161–172. [Google Scholar] [CrossRef]
- Kolter, J.Z.; Maloof, M.A. Dynamic weighted majority: An ensemble method for drifting concepts. J. Mach. Learn. Res. 2007, 8, 2755–2790. [Google Scholar]
- Elwell, R.; Polikar, R. Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 2011, 22, 1517–1531. [Google Scholar] [CrossRef] [PubMed]
- Gomes, H.M.; Barddal, J.P.; Enembreck, F.; Bifet, A. A survey on ensemble learning for data stream classification. ACM Comput. Surv. 2017, 50, 1–36. [Google Scholar] [CrossRef]
- Oza, N.C. Online bagging and boosting. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, HI, USA, 12 October 2005; Volume 3, pp. 2340–2345. [Google Scholar]
- Source Codes of FlexSketch. October 2020. Available online: https://xxxnell.github.io/flex/docs/core/sketch.html (accessed on 4 February 2021).
- Source Codes of Online Kernel Density Estimation. June 2017. Available online: https://github.com/joluet/okde-java (accessed on 4 February 2021).
- Source Codes of Streaming Parallel Decision Tree. June 2017. Available online: https://github.com/soundcloud/spdt (accessed on 4 February 2021).
- Bifet, A.; Holmes, G.; Kirkby, R.; Pfahringer, B. MOA: Massive online analysis. J. Mach. Learn. Res. 2010, 11, 1601–1604. [Google Scholar]
- Street, N.; Kim, Y. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 377–382. [Google Scholar]
- Thaper, N.; Guha, S.; Indyk, P.; Koudas, N. Dynamic multidimensional histograms. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02), Madison, WI, USA, 4–6 June 2002; Association for Computing Machinery: New York, NY, USA, 2002; pp. 428–439. [Google Scholar]
- Diakonikolas, I.; Kane, D.M.; Peebles, J. Testing identity of multidimensional histograms. In Proceedings of the Conference on Learning Theory (PMLR), Phoenix, AZ, USA, 25–28 June 2019; pp. 1107–1131. [Google Scholar]
- Jordaney, R.; Sharad, K.; Dash, S.K.; Wang, Z.; Papini, D.; Nouretdinov, I.; Cavallaro, L. Transcend: Detecting concept drift in malware classification models. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017; USENIX Association: Vancouver, BC, Canada, 2017; pp. 625–642. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, N.; Kim, S. FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams. Sensors 2021, 21, 1080. https://doi.org/10.3390/s21041080
Park N, Kim S. FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams. Sensors. 2021; 21(4):1080. https://doi.org/10.3390/s21041080
Chicago/Turabian StylePark, Namuk, and Songkuk Kim. 2021. "FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams" Sensors 21, no. 4: 1080. https://doi.org/10.3390/s21041080
APA StylePark, N., & Kim, S. (2021). FlexSketch: Estimation of Probability Density for Stationary and Non-Stationary Data Streams. Sensors, 21(4), 1080. https://doi.org/10.3390/s21041080