research-article

Open access

The Snowflake Elastic Data Warehouse

Authors:

Benoit Dageville,

Thierry Cruanes,

Marcin Zukowski,

Jonathan Claybaugh,

Daniel Engovatov,

Martin Hentschel,

Jiansheng Huang,

Allison W. Lee,

Ashish Motivala,

Abdul Q. Munir,

Spyridon Triantafyllis,

Philipp UnterbrunnerAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 215 - 226

https://doi.org/10.1145/2882903.2903741

Published: 14 June 2016 Publication History

Abstract

We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterprise-class systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the cloud's elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semi-structured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or "Snowflake" for short. Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. The system is offered as a pay-as-you-go service in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data.

In this paper, we describe the design of Snowflake and its novel multi-cluster, shared-data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.

References

[1]

D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proc. SIGMOD, 2008.

Digital Library

[2]

A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In Proc. VLDB, 2001.

Digital Library

[3]

S. Alsubaiee et al. AsterixDB: A scalable, open source DBMS. PVLDB, 7(14):1905--1916, 2014.

Digital Library

[4]

Amazon Elastic Compute Cloud (EC2). burlaws.amazon.com/ec2/instance-types.

[5]

Amazon Simple Storage Service (S3).burlaws.amazon.com/s3.

[6]

Apache Cassandra. burlcassandra.apache.org.

[7]

Apache Drill. burldrill.apache.org.

[8]

Apache Hadoop. burlhadoop.apache.org.

[9]

Apache Hive. burlhive.apache.org.

[10]

Apache Parquet. burlparquet.apache.org.

[11]

Apache Spark. burlspark.apache.org.

[12]

AWS CloudHSM. burlaws.amazon.com/cloudhsm.

[13]

E. Barker. NIST SP 800--57 -- Recommendation for Key Management -- Part 1: General (Revision 4), chapter 7. 2016.

[14]

J. Barr. AWS Import/Export Snowball -- Transfer 1 petabyte per week using Amazon-owned storage appliances. burlaws.amazon.com/blogs/aws/aws-importexport-snowball-transfer-1-petabyte%-per-week-using-amazon-owned-storage-appliances/, 2015.

[15]

P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In Proc. CIDR, 2005.

[16]

V. R. Borkar, M. J. Carey, and C. Li. Big data platforms: What's next? ACM Crossroads, 19(1):44--49, 2012.

Digital Library

[17]

M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable isolation for snapshot databases. In Proc. SIGMOD, 2008.

Digital Library

[18]

B. Calder et al. Windows Azure Storage: A highly available storage service with strong consistency. In Proc. SOSP, 2011.

Digital Library

[19]

Cassandra Query Language (CQL). burlcassandra.apache.org/doc/cql3/CQL.html.

[20]

Cloud Storage -- Google Cloud Platform. burlcloud.google.com/storage.

[21]

Cloudera Impala. burlimpala.io.

[22]

Couchbase N1QL. burlcouchbase.com/n1ql.

[23]

Couchbase Server. burlcouchbase.com.

[24]

D. J. DeWitt, A. Halverson, R. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in Polybase. In Proc. SIGMOD, 2013.

Digital Library

[25]

D. J. DeWitt, S. Madden, and M. Stonebraker. How to build a high-performance data warehouse. burldb.csail.mit.edu/madden/high_perf.pdf, 2006.

[26]

D. Ferraiolo, D. R. Kuhn, and R. Chandramouli. Role-based access control. Artech House Publishers, 2003.

[27]

G. Graefe. Volcano: An extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994.

Digital Library

[28]

G. Graefe. The cascades framework for query optimization. Data Engineering Bulletin, 18, 1995.

[29]

G. Graefe. Fast loads and fast queries. In Data Warehousing and Knowledge Discovery, volume 5691 of LNCS. Springer, 2009.

Digital Library

[30]

A. Gupta et al. Amazon Redshift and the case for simpler data warehouses. In Proc. SIGMOD, 2015.

Digital Library

[31]

D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proc. STOC, 1997.

Digital Library

[32]

J. Krueger, M. Grund, C. Tinnefeld, H. Plattner, A. Zeier, and F. Faerber. Optimizing write performance for read optimized databases. In Proc. DASFAA, 2010.

Digital Library

[33]

S. Manegold, M. L. Kersten, and P. Boncz. Database architecture evolution: Mammals flourished long before dinosaurs became extinct. PVLDB, 2(2):1648--1653, 2009.

Digital Library

[34]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1--2):330--339, 2010.

Digital Library

[35]

Microsoft Analytics Platform System. burlwww.microsoft.com/en-us/server-cloud/products/analytics-platform-syste%m.

[36]

Microsoft Azure Blob Storage. burlazure.microsoft.com/en-us/services/storage/blobs.

[37]

Microsoft Azure SQL DW. burlazure.microsoft.com/en-us/services/sql-data-warehouse.

[38]

G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In Proc. VLDB, 1998.

Digital Library

[39]

MongoDB. burlmongodb.com.

[40]

J. K. Mullin. Optimal semijoins for distributed database systems. IEEE TSE, 16(5):558--560, 1990.

Digital Library

[41]

T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011.

Digital Library

[42]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. SIGMOD, 2009.

Digital Library

[43]

Presto. burlprestodb.io.

[44]

K. Sato. An inside look at Google BigQuery. burlcloud.google.com/files/BigQueryTechnicalWP.pdf, 2012.

[45]

J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. PVLDB, 3(1):460--471, 2010.

Digital Library

[46]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Proc. MSST, 2010.

Digital Library

[47]

SQL DW Concurrency. burlazure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-de%velop-concurrency.

[48]

Stinger.next: Enterprise SQL at Hadoop scale. burlhortonworks.com/innovation/stinger.

[49]

L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin. Fine-grained partitioning for aggressive data skipping. In Proc. SIGMOD, 2014.

Digital Library

Cited By

Liu PCai PZhong KLi CChen H(2025)LRP: learned robust data partitioning for efficient processing of large dynamic queriesFrontiers of Computer Science10.1007/s11704-024-40509-419:9Online publication date: 22-Jan-2025
https://doi.org/10.1007/s11704-024-40509-4
Santhosh Gourishetti (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
https://doi.org/10.32628/CSEIT24106173
Tsakiridis STsimpiris AAngeioplastis APapaioannou NVarsamis D(2024)Analytics with Oracle APEX for Enhanced Data Warehouse Management: A Case Study of a Greek Soft Drinks CompanyGreen and Digital Transition – Challenge or Opportunity10.18690/um.fov.3.2024.73(1017-1030)Online publication date: 20-Mar-2024
https://doi.org/10.18690/um.fov.3.2024.73
Show More Cited By

Index Terms

The Snowflake Elastic Data Warehouse
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Online analytical processing engines
      2. Parallel and distributed DBMSs
        Relational parallel and distributed DBMSs
2. Networks
  1. Network services
    1. Cloud computing

Recommendations

MCDB: Using Multi-clouds to Ensure Security in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Security is considered to be one of the most critical aspects in a cloud computing environment due to the sensitive and important information stored in the cloud for users. Users are wondering about attacks on the integrity and the availability of their ...
Elastic Beanstalk
Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

201
Total Citations
View Citations
32,007
Total Downloads

Downloads (Last 12 months)6,423
Downloads (Last 6 weeks)652

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu PCai PZhong KLi CChen H(2025)LRP: learned robust data partitioning for efficient processing of large dynamic queriesFrontiers of Computer Science10.1007/s11704-024-40509-419:9Online publication date: 22-Jan-2025
https://doi.org/10.1007/s11704-024-40509-4
Santhosh Gourishetti (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
https://doi.org/10.32628/CSEIT24106173
Tsakiridis STsimpiris AAngeioplastis APapaioannou NVarsamis D(2024)Analytics with Oracle APEX for Enhanced Data Warehouse Management: A Case Study of a Greek Soft Drinks CompanyGreen and Digital Transition – Challenge or Opportunity10.18690/um.fov.3.2024.73(1017-1030)Online publication date: 20-Mar-2024
https://doi.org/10.18690/um.fov.3.2024.73
van Renen AStoian MKipf A(2024)DataLoom: Simplifying Data Loading with LLMsProceedings of the VLDB Endowment10.14778/3685800.368589717:12(4449-4452)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685897
Okolnychyi ASun CTanimura KSpitzer RBlue RHo SGu YLakkundi VTsai D(2024)Petabyte-Scale Row-Level Operations in Data LakehousesProceedings of the VLDB Endowment10.14778/3685800.368583417:12(4159-4172)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685834
Xue MBu YSomani AFan WLiu ZChen Svan Hovell HSamwel BMokhtar MKorlapati RLam AMa YErcegovac VLi JBehm ALi YLi XKrishnamurthy SShukla APetropoulos MParanjpye SXin RZaharia M(2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685818
Li GTian WZhang JGrosman RLiu ZLi S(2024)GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage DisaggregationProceedings of the VLDB Endowment10.14778/3685800.368580617:12(3786-3798)Online publication date: 8-Nov-2024
https://doi.org/10.14778/3685800.3685806
Schulze RSchreiber TYatsishin IDahimene RMilovidov A(2024)ClickHouse - Lightning Fast Analytics for EveryoneProceedings of the VLDB Endowment10.14778/3685800.368580217:12(3731-3744)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685802
Yu GWu ZKossmann FLi TMarkakis MNgom AMadden SKraska T(2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682026
Mersy GWang ZSintos SKrishnan S(2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682020
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten