skip to main content
research-article

On sampling from massive graph streams

Published: 01 August 2017 Publication History

Abstract

We propose Graph Priority Sampling (gps), a new paradigm for order-based reservoir sampling from massive graph streams. gps provides a general way to weight edge sampling according to auxiliary and/or size variables so as to accomplish various estimation goals of graph properties. In the context of subgraph counting, we show how edge sampling weights can be chosen so as to minimize the estimation variance of counts of specified sets of subgraphs. In distinction with many prior graph sampling schemes, gps separates the functions of edge sampling and subgraph estimation. We propose two estimation frameworks: (1) Post-Stream estimation, to allow gps to construct a reference sample of edges to support retrospective graph queries, and (2) In-Stream estimation, to allow gps to obtain lower variance estimates by incrementally updating the subgraph count estimates during stream processing. Unbiasedness of subgraph estimators is established through a new Martingale formulation of graph stream order sampling, in which subgraph estimators, written as a product of constituent edge estimators, are unbiased, even when computed at different points in the stream. The separation of estimation and sampling enables significant resource savings relative to previous work. We illustrate our framework with applications to triangle and wedge counting. We perform a large-scale experimental study on real-world graphs from various domains and types. gps achieves high accuracy with < 1% error for triangle and wedge counting, while storing a small fraction of the graph with average update times of a few microseconds per edge. Notably, for billion-scale graphs, gps accurately estimates triangle and wedge counts with < 1% error, while storing a small fraction of < 0.01% of the total edges in the graph.

References

[1]
C. Aggarwal, Y. Zhao, and P. Yu. Outlier detection in graph streams. In ICDE, pages 399--409, 2011.
[2]
N. K. Ahmed, N. Duffield, J. Neville, and R. Kompella. Graph sample and hold: A framework for big-graph analytics. In SIGKDD, 2014.
[3]
N. K. Ahmed, J. Neville, and R. Kompella. Network sampling: From static to streaming graphs. In TKDD, 8(2):1--56, 2014.
[4]
N. K. Ahmed, J. Neville, R. A. Rossi, and N. Duffield. Efficient graphlet counting for large networks. In ICDM.
[5]
N. K. Ahmed, T. Willke, and R. A. Rossi. Estimation of local subgraph counts. In IEEE Big Data, pages 1--10, 2016.
[6]
L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In Proc. of KDD, pages 16--24, 2008.
[7]
A. Buchsbaum, R. Giancarlo, and J. Westbrook. On finding common neighborhoods in massive graphs. Theoretical Computer Science, 299(1):707--718, 2003.
[8]
L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting triangles in data streams. In PODS.
[9]
N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. on Computing, 14(1):210--223, 1985.
[10]
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5):1402--1431, Sept. 2011.
[11]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In PODC, 2007.
[12]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. 2nd edition, 2001.
[13]
G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In PODS, pages 271--282, 2005.
[14]
L. De Stefani, A. Epasto, M. Riondato, and E. Upfal. Tri\est: Counting local and global triangles in fully-dynamic streams with fixed memory size. In KDD, 2016.
[15]
N. Duffield, C. Lund, and M. Thorup. Learn more, sample less, control of volume and variance in network measurement. IEEE T. Inf. Theory, 51(5):1756--1775, 2005.
[16]
N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. JACM, 54(6):32, 2007.
[17]
P. S. Efraimidis and P. G. Spirakis. Weighted random sampling with a reservoir. IPL, 97(5):181--185, 2006.
[18]
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2):207--216, 2005.
[19]
M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. External memory algorithms, 50:107--118, 1998.
[20]
M. Jha, C. Seshadhri, and A. Pinar. A space efficient streaming algorithm for triangle counting using the birthday paradox. In In ACM SIGKDD, pages 589--597, 2013.
[21]
D. E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1997.
[22]
J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. In TWEB, 1(1):5, 2007.
[23]
J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, pages 420--429, 2007.
[24]
Y. Lim and U. Kang. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proc. of SIGKDD, pages 685--694. ACM, 2015.
[25]
A. McGregor. Graph mining on streams. Encyclopedia of Database Systems, pages 1271--1275, 2009.
[26]
S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.
[27]
A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu. Counting and sampling triangles from a graph stream. Proc. of VLDB, 6(14):1870--1881, 2013.
[28]
B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.
[29]
R. A. Rossi and N. K. Ahmed. The network data repository withinteractive graph analytics and visualization. In AAAI, 2015.
[30]
A. D. Sarma, S. Gollapudi, and R. Panigrahy. Estimating PageRank on Graph Streams. In PODS, pages 69--78, 2008.
[31]
M. J. Schervish. Theory of Statistics. Springer, 1995.
[32]
Y. Tillé. Sampling Algorithms. Springer-Verlag.
[33]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11:37--57, 1985.
[34]
D. Williams. Probability with Martingales. Cambridge University Press, 1991.
[35]
P. Zhao, C. Aggarwal, and G. He. Link prediction in graph streams. In ICDE, pages 553--564, 2016.

Cited By

View all
  1. On sampling from massive graph streams

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 11
    August 2017
    432 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2017
    Published in PVLDB Volume 10, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media