skip to main content
10.1145/1807167.1807278acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Data warehousing and analytics infrastructure at facebook

Published: 06 June 2010 Publication History

Abstract

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

References

[1]
Apache Hadoop wiki. Available at http://wiki.apache.org/hadoop.
[2]
Apache Hadoop Hive wiki. Available at http://wiki.apache.org/hadoop/Hive.
[3]
Scribe wiki. Available at http://wiki.github.com/facebook/scribe.
[4]
Ghemawat, S., Gobioff, H. and Leung, S. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (Lake George, NY, Oct. 2003).
[5]
Dean, J. and Ghemawat S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 2004). OSDI'04.
[6]
HDFS Architecture. Available at http://hadoop.apache.org/common/docs/current/hdfs_design.pdf.
[7]
Hive DDL wiki. Available at http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL.
[8]
Thusoo, A., Murthy, R., Sen Sarma, J., Shao, Z., Jain, N., Chakka, P., Anthony, A., Liu, H., Zhang, N. 2010. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In Proceedings of 26th IEEE International Conference on Data Engineering (Long Beach, California, Mar. 2010). ICDE'10.
[9]
Ailamaki, A., DeWitt, D. J., Hill, M. D., Skounakis, M. 2001. Weaving Relations for Cache Performance. In Proceedings of 27th Very Large Data Base Conference (Roma, Italy, 2001). VLDB'01.
[10]
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I. 2009. Job Scheduling for Multi-User MapReduce Clusters. UC Berkeley Technical Report UCB/EECS-2009-55 (Apr. 2009).
[11]
Fan, B., Tantisiriroj, W., Xiao, Lin, Gibson, G. 2009. DiskReduce: RAID for Data-Intensive Scalable Computing. In Proceedings of 4th Petascale Data Storage Workshop Supercomputing Conference (Portland, Oregon, Nov. 2009). Supercomputing PDSW'09.

Cited By

View all

Index Terms

  1. Data warehousing and analytics infrastructure at facebook

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
    June 2010
    1286 pages
    ISBN:9781450300322
    DOI:10.1145/1807167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. analytics
    2. data discovery
    3. data warehouse
    4. distributed file system
    5. distributed systems
    6. facebook
    7. hadoop
    8. hive
    9. log aggregation
    10. map-reduce
    11. resource sharing
    12. scalability
    13. scribe

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '10
    Sponsor:
    SIGMOD/PODS '10: International Conference on Management of Data
    June 6 - 10, 2010
    Indiana, Indianapolis, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)101
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 09 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media