Article

Stronger semantics for low-latency geo-replicated storage

Authors:

Michael J. Freedman,

Michael Kaminsky,

David G. AndersenAuthors Info & Claims

nsdi'13: Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Pages 313 - 328

Published: 02 April 2013 Publication History

Abstract

We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides "stronger" semantics. Namely, all client requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers.

The primary contributions of this work are enabling scalable causal consistency for the complex columnfamily data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low latency (single-ms), has throughput competitive with eventually-consistent and non-transactional Cassandra (less than 7% overhead for one of Facebook's real-world workloads), and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

References

[1]

M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new paradigm for building scalable distributed systems. ACM TOCS, 27(3), 2009.

[2]

M. Ahamad, G. Neiger, P. Kohli, J. Burns, and P. Hutto. Causal memory: Definitions, implementation, and programming. Distributed Computing, 9 (1), 1995.

[3]

P. Alsberg and J. Day. A principle for resilient sharing of distributed resources. In Conf. Software Engineering, Oct. 1976.

[4]

Amazon. Simple storage service. http://aws.amazon.com/s3/, 2012.

[5]

D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In SOSP, Oct. 2009.

[6]

T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. ACM TOCS, 14(1), 1996.

[7]

B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In SIGMETRICS, 2012.

[8]

H. Attiya and J. L. Welch. Sequential consistency versus linearizability. ACM TOCS, 12(2), 1994.

[9]

J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, Jan. 2011.

[10]

N. Belaramani, M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yalagandula, and J. Zheng. PRACTI replication. In NSDI, May 2006.

[11]

P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computer Surveys, 13(2), June 1981.

[12]

K. P. Birman and R. V. Renesse. Reliable Distributed Computing with the ISIS Toolkit. IEEE Comp. Soc. Press, 1994.

[13]

W. Bolosky, D. Bradshaw, R. Haagens, N. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In NSDI, 2011.

[14]

N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun. A practical concurrent binary search tree. In PPoPP, Jan. 2010.

[15]

S. Bykov, A. Geller, G. Kliot, J. R. Larus, R. Pandya, and J. Thelin. Orleans: cloud computing for everyone. In SOCC, 2011.

[16]

B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows Azure Storage: a highly available cloud storage service with strong consistency. In SOSP, 2011.

[17]

Cassandra. http://cassandra.apache.org/, 2012.

[18]

A. Chan and R. Gray. Implementing distributed read-only transactions. IEEE Trans. Info. Theory, 11(2), 1985.

[19]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM TOCS, 26(2), 2008.

[20]

D. R. Cheriton and D. Skeen. Understanding the limitations of causally and totally ordered communication. In SOSP, Dec. 1993.

[21]

B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!' s hosted data serving platform. In VLDB, Aug. 2008.

[22]

J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In OSDI, Oct 2012.

[23]

CouchDB. http://couchdb.apache.org/, 2012.

[24]

J. Cowling and B. Liskov. Granola: low-overhead distributed transaction coordination. In USENIX ATC, Jun 2012.

[25]

P. Dixon. Shopzilla site redesign: We get what we measure. Velocity Conference Talk, 2009.

[26]

eBay. Personal communication, 2012.

[27]

Facebook. Personal communication, 2011.

[28]

J. Ferris. The TAO graph database. CMU PDL Talk, April 2012.

[29]

B. Fitzpatrick. Memcached: a distributed memory object caching system. http://memcached.org/, 2011.

[30]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, Oct. 2003.

[31]

L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in Scatter. In SOSP, Oct. 2011.

[32]

HBase. http://hbase.apache.org/, 2012.

[33]

M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM TOPLAS, 12(3), 1990.

[34]

D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In STOC, May 1997.

[35]

T. Kraska, G. Pang, M. J. Franklin, and S. Madden. MDCC: Multi-data center consistency. CoRR, abs/1203.6049, 2012.

[36]

R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing high availability using lazy replication. ACM TOCS, 10(4), 1992.

[37]

A. Lakshman and P. Malik. Cassandra - a decentralized structured storage system. In LADIS, Oct. 2009.

[38]

L. Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. ACM, 21(7), 1978.

[39]

L. Lamport. The part-time parliament. ACM TOCS, 16(2), 1998.

[40]

C. Li, D. Porto, A. Clement, J. Gehrke, N. Preguiça, and R. Rodrigues. Making geo-replicated systems fast as possible, consistent when necessary. In OSDI, Oct 2012.

[41]

G. Linden. Make data useful. Stanford CS345 Talk, 2006.

[42]

R. J. Lipton and J. S. Sandberg. PRAM: A scalable shared memory. Technical Report TR-180-88, Princeton Univ., Dept. Comp. Sci., 1988.

[43]

W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS. In SOSP, Oct. 2011.

[44]

K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers. Flexible update propagation for weakly consistent replication. In SOSP, Oct. 1997.

[45]

L. Peterson, A. Bavier, and S. Bhatia. VICCI: A programmable cloud-computing research testbed. Technical Report TR-912-11, Princeton Univ., Dept. Comp. Sci., 2011.

[46]

D. R. Ports, A. T. Clements, I. Zhang, S. Madden, and B. Liskov. Transactional consistency and automatic management in an application data cache. In OSDI, Oct. 2010.

[47]

PRObE. http://www.nmc-probe.org/, 2013.

[48]

Redis. http://redis.io/, 2012.

[49]

D. P. Reed. Naming and Synchronization in a Decentralized Computer Systems. PhD thesis, Mass. Inst. of Tech., 1978.

[50]

E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. Velocity Conference Talk, 2009.

[51]

D. Skeen. A formal model of crash recovery in a distributed system. IEEE Trans. Info. Theory, 9(3), May 1983.

[52]

Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storage for geo-replicated systems. In SOSP, Oct. 2011.

[53]

TAO. A read-optimized globally distributed store for social graph data. Under Submission, 2012.

[54]

J. Terrace and M. J. Freedman. Object storage on CRAQ: High-throughput chain replication for read-mostly workloads. In USENIX ATC, June 2009.

[55]

R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Sys., 4(2), 1979.

[56]

A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: fast distributed transactions for partitioned database systems. In SIGMOD, May 2012.

[57]

R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, Dec. 2004.

[58]

VICCI. http://vicci.org/, 2012.

[59]

B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In OSDI, Dec. 2002.

Cited By

Liu SMultazzu LWei HBasin D(2024)NOC-NOC: Towards Performance-optimal Distributed TransactionsProceedings of the ACM on Management of Data10.1145/36392642:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639264
Huang KLiu SChen ZWei HBasin DLi HPan A(2023)Efficient Black-Box Checking of Snapshot Isolation in DatabasesProceedings of the VLDB Endowment10.14778/3583140.358314516:6(1264-1276)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583145
Zhang YOuyang LHuang YMa X(2023)Conflict-free Replicated Priority Queue: Design, Verification and EvaluationProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609452(302-312)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609452
Show More Cited By

Stronger semantics for low-latency geo-replicated storage

Recommendations

Stronger consistency and semantics for low-latency geo-replicated storage
Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Cross datacenter replication is increasingly being deployed to bring data closer to the user and to overcome datacenter outages. The extent of the influence of wide-area communication on serializable transactions is not yet clear. In this work, we ...
Caerus: Low-Latency Distributed Transactions for Geo-Replicated Systems

Distributed deterministic database systems achieve high transaction throughput for geographically replicated data. Supporting transactions with ACID guarantees requires deterministic databases to order transactions globally to dictate execution order. In ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

nsdi'13: Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

April 2013

546 pages

Editors:
Nick Feamster
Georgia Tech
,
Jeff Mogul
HP Labs

Sponsors

VMware
Akamai: Akamai
Google Inc.
NSF
Facebook: Facebook

Publisher

USENIX Association

United States

Publication History

Published: 02 April 2013

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu SMultazzu LWei HBasin D(2024)NOC-NOC: Towards Performance-optimal Distributed TransactionsProceedings of the ACM on Management of Data10.1145/36392642:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639264
Huang KLiu SChen ZWei HBasin DLi HPan A(2023)Efficient Black-Box Checking of Snapshot Isolation in DatabasesProceedings of the VLDB Endowment10.14778/3583140.358314516:6(1264-1276)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583145
Zhang YOuyang LHuang YMa X(2023)Conflict-free Replicated Priority Queue: Design, Verification and EvaluationProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609452(302-312)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609452
Zhang JJi YMu STan CFedorova ANarayanan DDi Luna GQuerzoni L(2023)Viper: A Fast Snapshot Isolation CheckerProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567492(654-671)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567492
Cheng AShi XPan LSimpson AWheaton NLawande SBronson NBailis PCrooks NStoica I(2021)RAMP-TAOProceedings of the VLDB Endowment10.14778/3476311.347637914:12(3014-3027)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476379
Zhou MGuo TChen YWan JWang XZhang KGherbi AVenkatasubramanian NVeiga L(2021)PolygonProceedings of the 22nd International Middleware Conference: Industrial Track10.1145/3491084.3491428(16-22)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3491084.3491428
Rahmani KNagar KDelaware BJagannathan SFreund SYahav E(2021)Repairing serializability bugs in distributed database programs via automated schema refactoringProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454028(32-47)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454028
Uluyol MHuang AGoel AChowdhury MMadhyastha HBhagwan RPorter G(2020)Near-optimal latency versus cost tradeoffs in geo-distributed storageProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388254(157-180)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388254
Sreekanti VWu CLin XSchleier-Smith JGonzalez JHellerstein JTumanov A(2020)CloudburstProceedings of the VLDB Endowment10.14778/3407790.340783613:12(2438-2452)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3407790.3407836
Gavrielatos VKatsarakis ANagarajan VGrot BJoshi AGupta RShen X(2020)KiteProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374516(1-16)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374516
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents