skip to main content
10.1145/2723372.2735378acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Just can't get enough: Synthesizing Big Data

Published: 27 May 2015 Publication History

Abstract

With the rapidly decreasing prices for storage and storage systems ever larger data sets become economical. While only few years ago only successful transactions would be recorded in sales systems, today every user interaction will be stored for ever deeper analysis and richer user modeling. This has led to the development of big data systems, which offer high scalability and novel forms of analysis. Due to the rapid development and ever increasing variety of the big data landscape, there is a pressing need for tools for testing and benchmarking.
Vendors have little options to showcase the performance of their systems but to use trivial data sets like TeraSort or WordCount. Since customers' real data is typically subject to privacy regulations and rarely can be utilized, simplistic proof-of-concepts have to be used, leaving both, customers and vendors, unclear of the target use-case performance. As a solution, we present an automatic approach to data synthetization from existing data sources. Our system enables a fully automatic generation of large amounts of complex, realistic, synthetic data.

References

[1]
A. Alexandrov, K. Tzoumas, and V. Markl. Myriad: Scalable and Expressive Data Generation. In VLDB, 2012.
[2]
A. Arasu, R. Kaushik, and J. Li. Data Generation Using Declarative Constraints. In SIGMOD, 2011.
[3]
C. Binnig, D. Kossmann, E. Lo, and M. T. Özsu. QAGen: Generating Query-aware Test Databases. In SIGMOD, 2007.
[4]
N. Bruno and S. Chaudhuri. Flexible Database Generators. In VLDB, pages 1097--1107, 2005.
[5]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010.
[6]
M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012.
[7]
A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: Towards an industry standard benchmark for big data analytics. In SIGMOD, 2013.
[8]
J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD, pages 243--252, 1994.
[9]
J. E. Hoag and C. W. Thompson. A Parallel General-Purpose Synthetic Data Generator. SIGMOD Record, 36(1):19--24, 2007.
[10]
K. Houkjær, K. Torp, and R. Wind. Simple and Realistic Data Generation. In VLDB, pages 1243--1246, 2006.
[11]
P. J. Lin, B. Samadi, A. Cipolone, D. R. Jeske, S. Cox, C. Rendón, D. Holt, and R. Xiao. Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems. In ITNG, pages 707--712, Washington, DC, USA, 2006. IEEE Computer Society.
[12]
E. Lo, N. Cheng, and W.-K. Hon. Generating Databases for Query Workloads. PVLDB, 3(1--2):848--859, 2010.
[13]
Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan. BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. In WBDB, 2013.
[14]
M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Record, 29(4):64--71, 2000.
[15]
M. Poess, T. Rabl, M. Frank, and M. Danisch. A PDGF Implementation for TPC-H. In TPCTC, 2011.
[16]
M. Poess, T. Rabl, H.-A. Jacobsen, and B. Caufield. TPC-DI: The First Industry Benchmark for Data Integration. PVLDB, 13(7):1367--1378, 2014.
[17]
T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010.
[18]
T. Rabl, M. Poess, M. Danisch, and H.-A. Jacobsen. Rapid Development of Data Generators Using Meta Generators in PDGF. In DBTest, 2013.
[19]
T. Rabl, M. Poess, H.-A. Jacobsen, P. E. O'Neil, and E. O'Neil. Variations of the Star Schema Benchmark to Test Data Skew in Database Management Systems. In ICPE, 2013.
[20]
E. Shen and L. Antova. Reversing Statistics for Scalable Test Databases Generation. In DBTest, 2013.
[21]
V. Sikka. Does the World Need a New Benchmark? http://www.saphana.com/community/blogs/blog/2013/09/16/does-the-world-need-a-new-benchmark, 2013.
[22]
J. M. Stephens and M. Poess. MUDD: a multi-dimensional data generator. In WOSP, pages 104--109, 2004.
[23]
Y. Tay, B. T. Dai, D. T. Wang, E. Y. Sun, Y. Lin, and Y. Lin. UpSizeR: Synthetically Scaling an Empirical Relational Database. Information Systems, 38(8):1168--1183, 2013.
[24]
E. Torlak. Scalable Test Data Generation from Multidimensional Models. In FSE, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data generator
  2. dbsynth
  3. pdgf

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media