skip to main content
10.1145/1559845.1559893acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

E = MC3: managing uncertain enterprise data in a cluster-computing environment

Published: 29 June 2009 Publication History

Abstract

Modern enterprises must manage uncertain data for purposes of risk assessment and decisionmaking under uncertainty. The Monte Carlo approach embodied in the MCDB system of Jampani et al. is well suited for such a task. MCDB can support industrial strength business-intelligence queries over uncertain warehouse data. Moreover, MCDB's extensible approach to specifying uncertainty can also capture complex stochastic prediction models, allowing sophisticated ``what-if'' analyses within the DBMS. The MCDB computations can be highly CPU intensive, but offer the potential for massive parallelization. To realize this potential, we provide a new system, called MC3 (Monte Carlo Computation on a Cluster), that extends the MCDB approach to the map-reduce processing framework. MC3 can exploit the robustness and scalability of map-reduce, and can handle data stored in non-relational formats. We show how MCDB query plans over ``tuple bundles'' can be translated to sequences of map-reduce operations over nested data, and describe different parallelization schemes. We also provide and analyze several novel distributed algorithms for adding pseudorandom number seeds to tuple bundles. These algorithms ensure statistical correctness of the Monte-Carlo computations while minimizing the seed length. Our experiments show that MC3 can scale well for a variety of workloads.

References

[1]
P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio:A system for data, uncertainty, and lineage. In VLDB, 2006.
[2]
L. Antova, C. Koch, andD. Olteanu. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In ICDE, pages 1479--1480, 2007.
[3]
J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D. Suciu. MYSTIQ:a system for finding more answers by using probabilities. In ACM SIGMOD, pages 891--893, 2005.
[4]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable:a distributed storage system for structured data. In OSDI, pages 15--15, 2006.
[5]
H. chih Yang, A. Dasdan, R. -L. Hsiao, and D. S. Parker. Map-reduce-merge:simpli fied relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007.
[6]
P. D. Coddington. Random number generators for parallel computers. The NHSE Review, 2, 1996.
[7]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts:Yahoo!'s hosted data serving platform. Proc. VLDB, pages 1277--1288, 2008.
[8]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[9]
L. Devroye. Non-Uniform Random Variate Generation. Springer, 1986.
[10]
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
[11]
P. W. Glynn and S. Asmussen. Stochastic Simulation: Algorithms and Analysis. Springer, 2007.
[12]
Hadoop. http://hadoop.apache.org/core/.
[13]
H. Haramoto, M. Matsumoto, and P. L'Ecuyer. A fast jump ahead algorithm for linear recurrences in a polynomial space. In SETA, pages 290--298, 2008.
[14]
H. Haramoto, M. Matsumoto, T. Nishimura, F. Panneton, and P. L'Ecuyer. Efficient jump ahead for F 2 linear random number generators. INFORMS J. Computing, 20(3):385--390, 2008.
[15]
S. G. Henderson and B. L. Nelson, editors. Simulation. North-Holland, 2006.
[16]
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB:a Monte Carlo approach to managing uncertain data. In ACM SIGMOD, pages 687--700, 2008.
[17]
JAQL. http://code.google.com/p/jaql/.
[18]
JSON. http://www.json.org.
[19]
B. Kimelfeld and Y. Sagiv. Modeling and querying probabilistic XML data. SIGMOD Record, 37(4):69--77, 2008.
[20]
P. L'Ecuyer. Random numbers for simulation. Comm. ACM, 33(10):85--97, 1990.
[21]
P. L'Ecuyer. Good parameters and implementations for combined multiple recursive random number generators. Oper. Res., 47(1):159--164, 1999.
[22]
P. L'Ecuyer and T. H. Andres. A random number generator based on the combination of four LCGs. Math. Comput. Simul., 44(1):99--107, 1997.
[23]
M. Mascagni. Some methods of parallel pseudorandom number generation. In R. Schreiber, M. Heath, and A. Ranade, editors, Algorithms for Parallel Processing, pages 277--288. Springer, 1997.
[24]
F. Panneton, P. L'Ecuyer, and M. Matsumoto. Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Software, 32(1):1--16, 2006.
[25]
S. K. Park and K. W. Miller. Random number generators:Good ones are hard to find. Comm. ACM, 31(10):1192--1201, 1988.
[26]
C. Re and D. Suciu. Managing probabilistic data with MystiQ:The can-do, the could-do, and the can't-do. In SUM, pages 5--18, 2008.
[27]
SimpleDB. http://aws.amazon.com.
[28]
S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah. Orion 2. 0:native support for uncertain data. In ACM SIGMOD, pages 1239--1242, 2006.
[29]
SQLServer Data Services. http://www.microsoft.com/sql/dataservices/default.mspx.
[30]
A. Srinivasan, D. M. Ceperley, and M. Mascagni. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, pages 13--36. Wiley, 1997.
[31]
C. J. K. Tan. The PLFG parallel pseudo-random number generator. Future Generation Computer Systems, 18:693--698, 2002.
[32]
D. Z. Wang, E. Michelakis, M. N. Garofalakis, and J. M. Hellerstein. BayesStore:managing large, uncertain data repositories with probabilistic graphical models. Proc. VLDB, pages 340--351, 2008.

Cited By

View all

Index Terms

  1. E = MC3: managing uncertain enterprise data in a cluster-computing environment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
    June 2009
    1168 pages
    ISBN:9781605585512
    DOI:10.1145/1559845
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. jaql
    2. json
    3. map-reduce
    4. monte carlo
    5. uncertain data

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '09
    Sponsor:
    SIGMOD/PODS '09: International Conference on Management of Data
    June 29 - July 2, 2009
    Rhode Island, Providence, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media