research-article

Data warehousing and analytics infrastructure at facebook

Authors:

Hao LiuAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 1013 - 1020

https://doi.org/10.1145/1807167.1807278

Published: 06 June 2010 Publication History

Get Access

Abstract

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

References

[1]

Apache Hadoop wiki. Available at http://wiki.apache.org/hadoop.

Google Scholar

[2]

Apache Hadoop Hive wiki. Available at http://wiki.apache.org/hadoop/Hive.

Google Scholar

[3]

Scribe wiki. Available at http://wiki.github.com/facebook/scribe.

Google Scholar

[4]

Ghemawat, S., Gobioff, H. and Leung, S. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (Lake George, NY, Oct. 2003).

Digital Library

Google Scholar

[5]

Dean, J. and Ghemawat S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 2004). OSDI'04.

Digital Library

Google Scholar

[6]

HDFS Architecture. Available at http://hadoop.apache.org/common/docs/current/hdfs_design.pdf.

Google Scholar

[7]

Hive DDL wiki. Available at http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL.

Google Scholar

[8]

Thusoo, A., Murthy, R., Sen Sarma, J., Shao, Z., Jain, N., Chakka, P., Anthony, A., Liu, H., Zhang, N. 2010. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In Proceedings of 26th IEEE International Conference on Data Engineering (Long Beach, California, Mar. 2010). ICDE'10.

Crossref

Google Scholar

[9]

Ailamaki, A., DeWitt, D. J., Hill, M. D., Skounakis, M. 2001. Weaving Relations for Cache Performance. In Proceedings of 27th Very Large Data Base Conference (Roma, Italy, 2001). VLDB'01.

Digital Library

Google Scholar

[10]

Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I. 2009. Job Scheduling for Multi-User MapReduce Clusters. UC Berkeley Technical Report UCB/EECS-2009-55 (Apr. 2009).

Google Scholar

[11]

Fan, B., Tantisiriroj, W., Xiao, Lin, Gibson, G. 2009. DiskReduce: RAID for Data-Intensive Scalable Computing. In Proceedings of 4th Petascale Data Storage Workshop Supercomputing Conference (Portland, Oregon, Nov. 2009). Supercomputing PDSW'09.

Digital Library

Google Scholar

Cited By

View all

Shailin Saraiya (2025)Technical Evolution and Performance Analysis of MapReduce in Modern Distributed SystemsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511120611:1(29-35)Online publication date: 3-Jan-2025
https://doi.org/10.32628/CSEIT25111206
Rhim AChiu KChiu DHo K(2024)Transforming Online Travel Agencies Using an Alert Management SystemAdapting to Evolving Consumer Experiences in Hospitality and Tourism10.4018/979-8-3693-7021-6.ch002(31-60)Online publication date: 15-Nov-2024
https://doi.org/10.4018/979-8-3693-7021-6.ch002
Su NHuang SSu C(2024)Elevating Smart Manufacturing with a Unified Predictive Maintenance Platform: The Synergy between Data Warehousing, Apache Spark, and Machine LearningSensors10.3390/s2413423724:13(4237)Online publication date: 29-Jun-2024
https://doi.org/10.3390/s24134237
Show More Cited By

Index Terms

Data warehousing and analytics infrastructure at facebook
1. Information systems

Recommendations

Apache hadoop goes realtime at Facebook
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the ...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...

Comments

Information & Contributors

Information

Published In

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

273
Total Citations
View Citations
6,547
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)8

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Shailin Saraiya (2025)Technical Evolution and Performance Analysis of MapReduce in Modern Distributed SystemsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511120611:1(29-35)Online publication date: 3-Jan-2025
https://doi.org/10.32628/CSEIT25111206
Rhim AChiu KChiu DHo K(2024)Transforming Online Travel Agencies Using an Alert Management SystemAdapting to Evolving Consumer Experiences in Hospitality and Tourism10.4018/979-8-3693-7021-6.ch002(31-60)Online publication date: 15-Nov-2024
https://doi.org/10.4018/979-8-3693-7021-6.ch002
Su NHuang SSu C(2024)Elevating Smart Manufacturing with a Unified Predictive Maintenance Platform: The Synergy between Data Warehousing, Apache Spark, and Machine LearningSensors10.3390/s2413423724:13(4237)Online publication date: 29-Jun-2024
https://doi.org/10.3390/s24134237
Cuzzocrea A(2024)Privacy-preserving OLAP against big query workloads: innovative theories and theoremsDistributed and Parallel Databases10.1007/s10619-024-07445-543:1Online publication date: 1-Nov-2024
https://doi.org/10.1007/s10619-024-07445-5
Cuzzocrea A(2023)Privacy-Preserving OLAP via Modeling and Analysis of Query Workloads: Innovative Theories and TheoremsProceedings of the 35th International Conference on Scientific and Statistical Database Management10.1145/3603719.3603735(1-12)Online publication date: 10-Jul-2023
https://dl.acm.org/doi/10.1145/3603719.3603735
Sahraei ADemetriou SSobhgol AZhang HNagaraja APathak NJoshi GSouza CHuang BCook WGolovei AVenkat PMcfague ASkarlatos DPatel VThind RGonzalez EJin YTang CDruschel PKaufmann AMace JFlinn JSeltzer M(2023)XFaaS: Hyperscale and Low Cost Serverless Functions at MetaProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613155(231-246)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613155
Patil TAnand KBhateja AJamal KSawant-Patil SPaygude P(2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392270
Ansari ZParwez MThoker IJahiruddin (2023)Enhanced subgraph matching for large graphs using candidate region-based decomposition and orderingJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10169435:8(101694)Online publication date: Sep-2023
https://doi.org/10.1016/j.jksuci.2023.101694
Liu YOu YChen WChen ZXiao N(2023)LazySort: A customized sorting algorithm for non-volatile memoryInformation Sciences10.1016/j.ins.2023.119137641(119137)Online publication date: Sep-2023
https://doi.org/10.1016/j.ins.2023.119137
Tomita A(2022)A Framework to Manage a Penetration of Digital Systems into Physical Society2022 Portland International Conference on Management of Engineering and Technology (PICMET)10.23919/PICMET53225.2022.9882756(1-7)Online publication date: Aug-2022
https://doi.org/10.23919/PICMET53225.2022.9882756
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Apache hadoop goes realtime at Facebook

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Scale-out beyond map-reduce

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations