skip to main content
research-article

PR-sketch: monitoring per-key aggregation of streaming data with nearly full accuracy

Published: 01 June 2021 Publication History

Abstract

Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like anomaly detection, attack prevention, and online diagnosis. Even though many algorithms have been proposed for per-key aggregation in stream processing, their accuracy guarantees only cover a small portion of keys. In this paper, we aim to achieve nearly full accuracy with limited resource usage. We follow the line of sketch-based techniques. We observe that existing methods suffer from high errors for most keys. The reason is that they track keys by complicated mechanism in the update phase and simply calculate per-key aggregation from some specific counter in the recovery phase. Therefore, we present PR-Sketch, a novel sketching design to address the two limitations. PR-Sketch builds linear equations between counter values and per-key aggregations to improve accuracy, and records keys in the recovery phase to reduce resource usage in the update phase. We also provide an extension called fast PR-Sketch to improve processing rate further. We derive space complexity, time complexity, and guaranteed error probability for both PR-Sketch and fast PR-Sketch. We conduct trace-driven experiments under 100K keys and 1M items to compare our algorithms with multiple state-of-the-art methods. Results demonstrate the resource efficiency and nearly full accuracy of our algorithms.

References

[1]
A. Appleby. https://github.com/aappleby/smhasher.
[2]
CAIDA. https://www.caida.org/home.
[3]
Eigen. https://eigen.tuxfamily.org.
[4]
High Capacity StrataXGS® Trident II Ethernet Switch Series. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56850-series.
[5]
Intel Altera. https://www.infinite-electronic.pt/pdf/d5b3139760/5SEEBF45I2N.pdf.
[6]
Intel FlexPipe. https://goo.gl/kUqpU7.
[7]
Kosarak. http://fimi.uantwerpen.be/data/kosarak.dat.gz.
[8]
Least Squares Conjugate Gradient. https://eigen.tuxfamily.org/dox/classEigen_1_1LeastSquaresConjugateGradient.html.
[9]
NVIDIA BlueField SmartNIC. https://www.mellanox.com/related-docs/prod_adapter_cards/PB_BlueField_Smart_NIC.pdf.
[10]
Retail. http://fimi.uantwerpen.be/data/retail.dat.gz.
[11]
Stingray SmartNIC. https://www.broadcom.com/products/ethernet-connectivity/network-adapters/smartnic/bcm58800.
[12]
Lada A Adamic and Bernardo A Huberman. 2002. Zipf's Law and the Internet. Trans. on Glottometric 3, 1 (2002), 143--150.
[13]
Rakesh Agarwal, Ramakrishnan Srikant, et al. 1994. Fast Algorithms for Mining Association Rules. In Proc. of VLDB.
[14]
Sugam Agarwal, Murali Kodialam, and TV Lakshman. 2013. Traffic Engineering in Software Defined Networks. In Proc. of IEEE INFOCOM.
[15]
Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, and Erez Waisbard. 2017. Constant Time Updates in Hierarchical Heavy Hitters. In Proc. of ACM SIGCOMM.
[16]
Theophilus Benson, Aditya Akella, and David A Maltz. 2010. Network Traffic Characteristics of Data Centers in the Wild. In Proc. of ACM SIGCOMMM Conference on Internet Measurement Conference.
[17]
Burton H Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Trans. on Communications of the ACM 13, 7 (1970), 422--426.
[18]
Tian Bu, Jin Cao, Aiyou Chen, and Patrick PC Lee. 2010. Sequential Hashing: A Flexible Approach for Unveiling Significant Patterns in High Speed Networks. Trans. on COMSNETS 54, 18 (2010), 3309--3326.
[19]
Andrew Campbell, Geoff Coulson, and David Hutchison. 1994. A Quality of Service Architecture. ACM Trans. on SIGCOMM 24, 2 (1994), 6--27.
[20]
Marco Canini, Damien Fay, David J Miller, Andrew W Moore, and Raffaele Bolla. 2009. Per Flow Packet Sampling for High-Speed Network Monitoring. In Proc. of IEEE COMSNETS.
[21]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding Frequent Items in Data Streams. In Proc. of International Colloquium on Automata, Languages, and Programming.
[22]
Graham Cormode, Minos Garofalakis, Peter J Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Trans. on Foundations and Trends in Databases 4, 1--3 (2012), 1--294.
[23]
Graham Cormode and Shan Muthukrishnan. 2005. An Improved Data Stream Summary: the Count-Min Sketch and Its Applications. Journal of Algorithms 55, 1 (2005), 58--75.
[24]
Graham Cormode and Shanmugavelayutham Muthukrishnan. 2005. What's New: Finding Significant Differences in Network Data Streams. IEEE/ACM Trans. on Networking 13, 6 (2005), 1219--1232.
[25]
Christos Douligeris and Aikaterini Mitrokotsa. 2004. DDoS Attacks and Defense Mechanisms: Classification and State-of-the-art. IEEE Trans. on COMSNETS 44, 5 (2004), 643--666.
[26]
Nick Duffield, Carsten Lund, and Mikkel Thorup. 2003. Estimating Flow Distributions from Sampled Flow Statistics. In Proc. of ACM SIGCOMM.
[27]
David Eppstein, Michael T Goodrich, Frank Uyeda, and George Varghese. 2011. What's the Difference? Efficient Set Reconciliation without Prior Context. ACM SIGCOMM 41, 4 (2011), 218--229.
[28]
Cristian Estan and George Varghese. 2003. New Directions in Traffic Measurement and Accounting: Focusing on the Elephants, Ignoring the Mice. ACM Trans. on Computer Systems 21, 3 (2003), 270--313.
[29]
Li Fan, Pei Cao, Jussara Almeida, and Andrei Z Broder. 2000. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. IEEE/ACM Trans. on Networking 8, 3 (2000), 281--293.
[30]
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, et al. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proc. of ACM SIGCOMM.
[31]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proc. of ACM SIGOPS HotOS Workshop.
[32]
Qun Huang, Xin Jin, Patrick PC Lee, Runhui Li, Lu Tang, Yi-Chao Chen, and Gong Zhang. 2017. Sketchvisor: Robust Network Measurement for Software Packet Processing. In Proc. of ACM SIGCOMM.
[33]
Qun Huang and Patrick PC Lee. 2014. LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams. In Proc. of IEEE INFOCOM.
[34]
Qun Huang, Patrick PC Lee, and Yungang Bao. 2018. Sketchlearn: Relieving User Burdens in Approximate Measurement with Automated Statistical Inference. In Proc. of ACM SIGCOMM.
[35]
Qun Huang, Haifeng Sun, Patrick PC Lee, Wei Bai, Feng Zhu, and Yungang Bao. 2020. OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy. In Proc. of ACM SIGCOMM.
[36]
Srikanth Kandula and Ratul Mahajan. 2009. Sampling Biases in Network Path Measurements and What To Do About It. In Proc. of ACM SIGCOMM.
[37]
Anurag Khandelwal, Rachit Agarwal, and Ion Stoica. 2019. Confluo: Distributed Monitoring and Diagnosis Stack for High-Speed Networks. In Proc. of USENIX NSDI.
[38]
George Kollios, John W Byers, Jeffrey Considine, Marios Hadjieleftheriou, and Feifei Li. 2005. Robust Aggregation in Sensor Networks. IEEE Trans. on Data Eng. Bull. 28, 1 (2005), 26--32.
[39]
Abhishek Kumar, Minho Sung, Jun Xu, and Jia Wang. 2004. Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution. ACM Trans. on SIGMETRICS Performance Evaluation Review 32, 1 (2004), 177--188.
[40]
Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining Anomalies Using Traffic Feature Distributions. ACM Trans. on SIGCOMM computer communication review 35, 4 (2005), 217--228.
[41]
Yuliang Li, Rui Miao, Mohammad Alizadeh, and Minlan Yu. 2019. DETER: Deterministic TCP Replay for Performance Diagnosis. In Proc. of USENIX NSDI.
[42]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. 2016. Flowradar: A Better Netflow for Data Centers. In Proc. of ACM SIGCOMM.
[43]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. 2016. LossRadar: Fast Detection of Lost Packets in Data Center Networks. In Proc. of ACM CoNEXT.
[44]
Yang Liu, Wenji Chen, and Yong Guan. 2012. A Fast Sketch for Aggregate Queries over High-Speed Network Traffic. In Proc. of IEEE INFOCOM.
[45]
Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. 2019. Nitrosketch: Robust and General Sketch-based Monitoring in Software Switches. In Proc. of ACM SIGCOMM.
[46]
Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One Sketch to Rule Them All: Rethinking Network Flow Monitoring with Univmon. In Proc. of ACM SIGCOMM.
[47]
Haohui Mai, Ahmed Khurshid, Rachit Agarwal, Matthew Caesar, P Brighten Godfrey, and Samuel Talmadge King. 2011. Debugging the Data Plane with Anteater. ACM Trans. on SIGCOMM Computer Communication Review 41, 4 (2011), 290--301.
[48]
Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate Frequency Counts over Data Streams. In Proc. of VLDB.
[49]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient Computation of Frequent and Top-k Elements in Data Streams. In Proc. of Springer ICDT.
[50]
Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. 2017. SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs. In Proc. of ACM SIGCOMM.
[51]
Michael Mitzenmacher. 2002. Compressed Bloom Filters. IEEE/ACM Trans. on Networking 10, 5 (2002), 604--612.
[52]
Vern Paxson. 1994. Empirically Derived Analytic Models of Wide-Area TCP Connections. IEEE/ACM Trans. on Networking 2, 4 (1994), 316--336.
[53]
Vern Paxson and Sally Floyd. 1995. Wide Area Traffic: The Failure of Poisson Modeling. IEEE/ACM Trans. on networking 3, 3 (1995), 226--244.
[54]
Robert Schweller, Zhichun Li, Yan Chen, Yan Gao, Ashish Gupta, Yin Zhang, Peter A Dinda, Ming-Yang Kao, and Gokhan Memik. 2007. Reversible Sketches: Enabling Monitoring and Analysis over High-Speed Data Streams. IEEE/ACM Trans. on Networking 15, 5 (2007), 1059--1072.
[55]
Danfeng Shan and Fengyuan Ren. 2017. Improving ECN Marking Scheme with Micro-burst Traffic in Data Center Networks. In Proc. of IEEE INFOCOM.
[56]
Brent Stephens, Aditya Akella, and Michael Swift. 2019. Loom: Flexible and Efficient NIC Packet Scheduling. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19).
[57]
Daniel Stutzbach, Reza Rejaie, Nick Duffield, Subhabrata Sen, and Walter Willinger. 2008. On Unbiased Sampling for Unstructured Peer-To-Peer Networks. IEEE/ACM Trans. on Networking 17, 2 (2008), 377--390.
[58]
Lu Tang, Qun Huang, and Patrick PC Lee. 2019. MV-Sketch: A Fast and Compact Invertible Sketch for Heavy Flow Detection in Network Data Streams. In Proc. of IEEE INFOCOM.
[59]
Gang Wang, Tristan Konolige, Christo Wilson, Xiao Wang, Haitao Zheng, and Ben Y Zhao. 2013. You Are How You Click: Clickstream Analysis for Sybil Detection. In Proc. of Usenix Security Symposium.
[60]
Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A Linear-Time Probabilistic Counting Algorithm for Database Application. ACM Trans. on Database Systems 15, 2 (1990), 208--229.
[61]
Sisi Xiong, Yanjun Yao, Qing Cao, and Tian He. 2014. kBF: A Bloom Filter for Key-Value Storage with an Application on Approximate State Machines. In Proc. of IEEE INFOCOM.
[62]
Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. 2005. Profiling Internet Backbone Traffic: Behavior Models and Applications. ACM Trans. on SIGCOMM Computer Communication Review 35, 4 (2005), 169--180.
[63]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic Sketch: Adaptive and Fast Network-Wide Measurements. In Proc. of ACM SIGCOMM.
[64]
Tong Yang, Lingtong Liu, Yibo Yan, Muhammad Shahzad, Yulong Shen, Xiaoming Li, Bin Cui, and Gaogang Xie. 2017. SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items. In Proc. of IEEE ICDE.
[65]
Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid Sketch: A Sketch Framework for Frequency Estimation of Data Streams. Trans. on VLDB 10, 11 (2017), 1442--1453.
[66]
Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig. 2018. Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing. In Proc. of ACM SIGMOD.
[67]
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, et al. 2015. Packet-Level Telemetry in Large Datacenter Networks. In Proc. of ACM SIGCOMM.

Cited By

View all

Index Terms

  1. PR-sketch: monitoring per-key aggregation of streaming data with nearly full accuracy
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 14, Issue 10
          June 2021
          219 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          Published: 01 June 2021
          Published in PVLDB Volume 14, Issue 10

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)80
          • Downloads (Last 6 weeks)5
          Reflects downloads up to 16 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Lightweight Acquisition and Ranging of Flows in the Data PlaneACM SIGMETRICS Performance Evaluation Review10.1145/3673660.365506352:1(21-22)Online publication date: 13-Jun-2024
          • (2024)Lightweight Acquisition and Ranging of Flows in the Data PlaneAbstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems10.1145/3652963.3655063(21-22)Online publication date: 10-Jun-2024
          • (2023)SPADA: A Sparse Approximate Data Structure Representation for Data Plane Per-flow MonitoringProceedings of the ACM on Networking10.1145/36291491:CoNEXT3(1-25)Online publication date: 28-Nov-2023
          • (2023)Lightweight Acquisition and Ranging of Flows in the Data PlaneProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36267757:3(1-24)Online publication date: 7-Dec-2023
          • (2023)BitSense: Universal and Nearly Zero-Error Optimization for Sketch Counters with Compressive SensingProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604865(220-238)Online publication date: 10-Sep-2023
          • (2023)Tight-Sketch: A High-Performance Sketch for Heavy Item-Oriented Data Stream Mining with Limited Memory SizeProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615080(1328-1337)Online publication date: 21-Oct-2023
          • (2022)Bayesian Sketches for Volume Estimation in Data StreamsProceedings of the VLDB Endowment10.14778/3574245.357425216:4(657-669)Online publication date: 1-Dec-2022

          View Options

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media