skip to main content
10.5555/2482626.2482657guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Stronger semantics for low-latency geo-replicated storage

Published: 02 April 2013 Publication History

Abstract

We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides "stronger" semantics. Namely, all client requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers.
The primary contributions of this work are enabling scalable causal consistency for the complex columnfamily data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low latency (single-ms), has throughput competitive with eventually-consistent and non-transactional Cassandra (less than 7% overhead for one of Facebook's real-world workloads), and scales out to large clusters almost linearly (averaging 96% increases up to 128 server clusters).

References

[1]
M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, and C. Karamanolis. Sinfonia: A new paradigm for building scalable distributed systems. ACM TOCS, 27(3), 2009.
[2]
M. Ahamad, G. Neiger, P. Kohli, J. Burns, and P. Hutto. Causal memory: Definitions, implementation, and programming. Distributed Computing, 9 (1), 1995.
[3]
P. Alsberg and J. Day. A principle for resilient sharing of distributed resources. In Conf. Software Engineering, Oct. 1976.
[4]
Amazon. Simple storage service. http://aws.amazon.com/s3/, 2012.
[5]
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In SOSP, Oct. 2009.
[6]
T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. ACM TOCS, 14(1), 1996.
[7]
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In SIGMETRICS, 2012.
[8]
H. Attiya and J. L. Welch. Sequential consistency versus linearizability. ACM TOCS, 12(2), 1994.
[9]
J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, Jan. 2011.
[10]
N. Belaramani, M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yalagandula, and J. Zheng. PRACTI replication. In NSDI, May 2006.
[11]
P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computer Surveys, 13(2), June 1981.
[12]
K. P. Birman and R. V. Renesse. Reliable Distributed Computing with the ISIS Toolkit. IEEE Comp. Soc. Press, 1994.
[13]
W. Bolosky, D. Bradshaw, R. Haagens, N. Kusters, and P. Li. Paxos replicated state machines as the basis of a high-performance data store. In NSDI, 2011.
[14]
N. G. Bronson, J. Casper, H. Chafi, and K. Olukotun. A practical concurrent binary search tree. In PPoPP, Jan. 2010.
[15]
S. Bykov, A. Geller, G. Kliot, J. R. Larus, R. Pandya, and J. Thelin. Orleans: cloud computing for everyone. In SOCC, 2011.
[16]
B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, et al. Windows Azure Storage: a highly available cloud storage service with strong consistency. In SOSP, 2011.
[17]
Cassandra. http://cassandra.apache.org/, 2012.
[18]
A. Chan and R. Gray. Implementing distributed read-only transactions. IEEE Trans. Info. Theory, 11(2), 1985.
[19]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM TOCS, 26(2), 2008.
[20]
D. R. Cheriton and D. Skeen. Understanding the limitations of causally and totally ordered communication. In SOSP, Dec. 1993.
[21]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. PNUTS: Yahoo!' s hosted data serving platform. In VLDB, Aug. 2008.
[22]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In OSDI, Oct 2012.
[23]
CouchDB. http://couchdb.apache.org/, 2012.
[24]
J. Cowling and B. Liskov. Granola: low-overhead distributed transaction coordination. In USENIX ATC, Jun 2012.
[25]
P. Dixon. Shopzilla site redesign: We get what we measure. Velocity Conference Talk, 2009.
[26]
eBay. Personal communication, 2012.
[27]
Facebook. Personal communication, 2011.
[28]
J. Ferris. The TAO graph database. CMU PDL Talk, April 2012.
[29]
B. Fitzpatrick. Memcached: a distributed memory object caching system. http://memcached.org/, 2011.
[30]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, Oct. 2003.
[31]
L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in Scatter. In SOSP, Oct. 2011.
[32]
HBase. http://hbase.apache.org/, 2012.
[33]
M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM TOPLAS, 12(3), 1990.
[34]
D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In STOC, May 1997.
[35]
T. Kraska, G. Pang, M. J. Franklin, and S. Madden. MDCC: Multi-data center consistency. CoRR, abs/1203.6049, 2012.
[36]
R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing high availability using lazy replication. ACM TOCS, 10(4), 1992.
[37]
A. Lakshman and P. Malik. Cassandra - a decentralized structured storage system. In LADIS, Oct. 2009.
[38]
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. ACM, 21(7), 1978.
[39]
L. Lamport. The part-time parliament. ACM TOCS, 16(2), 1998.
[40]
C. Li, D. Porto, A. Clement, J. Gehrke, N. Preguiça, and R. Rodrigues. Making geo-replicated systems fast as possible, consistent when necessary. In OSDI, Oct 2012.
[41]
G. Linden. Make data useful. Stanford CS345 Talk, 2006.
[42]
R. J. Lipton and J. S. Sandberg. PRAM: A scalable shared memory. Technical Report TR-180-88, Princeton Univ., Dept. Comp. Sci., 1988.
[43]
W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS. In SOSP, Oct. 2011.
[44]
K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers. Flexible update propagation for weakly consistent replication. In SOSP, Oct. 1997.
[45]
L. Peterson, A. Bavier, and S. Bhatia. VICCI: A programmable cloud-computing research testbed. Technical Report TR-912-11, Princeton Univ., Dept. Comp. Sci., 2011.
[46]
D. R. Ports, A. T. Clements, I. Zhang, S. Madden, and B. Liskov. Transactional consistency and automatic management in an application data cache. In OSDI, Oct. 2010.
[47]
PRObE. http://www.nmc-probe.org/, 2013.
[48]
Redis. http://redis.io/, 2012.
[49]
D. P. Reed. Naming and Synchronization in a Decentralized Computer Systems. PhD thesis, Mass. Inst. of Tech., 1978.
[50]
E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. Velocity Conference Talk, 2009.
[51]
D. Skeen. A formal model of crash recovery in a distributed system. IEEE Trans. Info. Theory, 9(3), May 1983.
[52]
Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storage for geo-replicated systems. In SOSP, Oct. 2011.
[53]
TAO. A read-optimized globally distributed store for social graph data. Under Submission, 2012.
[54]
J. Terrace and M. J. Freedman. Object storage on CRAQ: High-throughput chain replication for read-mostly workloads. In USENIX ATC, June 2009.
[55]
R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Sys., 4(2), 1979.
[56]
A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: fast distributed transactions for partitioned database systems. In SIGMOD, May 2012.
[57]
R. van Renesse and F. B. Schneider. Chain replication for supporting high throughput and availability. In OSDI, Dec. 2004.
[58]
VICCI. http://vicci.org/, 2012.
[59]
B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In OSDI, Dec. 2002.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
nsdi'13: Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
April 2013
546 pages

Sponsors

  • VMware
  • Akamai: Akamai
  • Google Inc.
  • NSF
  • Facebook: Facebook

Publisher

USENIX Association

United States

Publication History

Published: 02 April 2013

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media